Signboard recognition method and device based on multi-network layer loss fusion

By using a multi-layer loss fusion method, an N-layer neural network is constructed and cross-semantic information fusion and loss function optimization are performed, which solves the problem of inaccurate image feature extraction in high-precision maps and improves mapping accuracy.

CN116152780BActive Publication Date: 2026-06-16ZHIDAO NETWORK TECH (BEIJING) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHIDAO NETWORK TECH (BEIJING) CO LTD
Filing Date
2023-02-17
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

In existing technologies, smoothing post-processing algorithms have low accuracy in image feature extraction during high-precision map production, which cannot meet the needs of high-precision map recognition tasks.

Method used

We employ a multi-layer loss fusion approach to construct an N-layer neural network. By fusing input and output feature maps through cross-semantic information, we establish loss functions for different network layers and optimize the neural network parameters.

🎯Benefits of technology

This improves the accuracy of neural networks in extracting image features and enhances the mapping precision of high-precision maps.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116152780B_ABST
    Figure CN116152780B_ABST
Patent Text Reader

Abstract

The application relates to a signboard recognition method and device based on multi-network layer loss fusion. The method comprises the following steps: acquiring a to-be-recognized image; predicting a position region corresponding to a signboard in the to-be-recognized image according to a preset prediction model; the construction method of the prediction model comprises the following steps: constructing a neural network of N-layer network layers; in the input feature map and output feature map acquisition process of the 2th to N-1th network layers of the neural network, the input feature maps and the output feature maps of different network layers are cross-semantically information-fused; a plurality of loss values are acquired according to the output feature maps of the 1th to N-1th network layers of the neural network and label image data; and the parameters of the neural network are adjusted according to the plurality of loss values. According to the scheme provided by the application, different hierarchical semantic features of the input feature map and the output feature map are fused, different loss functions are established to optimize the neural network, and the accuracy of image feature extraction is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of high-precision map technology, and in particular to a road sign recognition method and apparatus based on multi-network layer loss fusion. Background Technology

[0002] During the production of high-precision maps, smoothing post-processing algorithms are generally used to optimize the images in order to improve their display quality.

[0003] In related technologies, smoothing post-processing algorithms often employ algorithms such as Hr-net and bis-net. However, these algorithms have low accuracy in extracting image features and cannot complete high-precision map image recognition tasks. Summary of the Invention

[0004] To address or partially address the problems existing in related technologies, this application provides a road sign recognition method and apparatus based on multi-network layer loss fusion. This method can fuse semantic features at different levels of the input feature map and the output feature map, and establish loss functions corresponding to different network layers to optimize the neural network, effectively improving the accuracy of the neural network in extracting image features and further improving the mapping accuracy of high-precision maps.

[0005] The first aspect of this application provides a road sign recognition method based on multi-network layer loss fusion, including:

[0006] Acquire the image to be recognized;

[0007] The location region corresponding to the road sign in the image to be identified is predicted according to a preset prediction model;

[0008] The method for constructing the prediction model includes:

[0009] Construct a neural network with N layers;

[0010] In the process of obtaining the input feature maps and output feature maps of the 2nd to N-1th layers of the neural network, the input feature maps and output feature maps of different layers of the network are fused with cross-semantic information.

[0011] Based on the output feature maps and label image data of the 1st to N-1th layers of the neural network, obtain multiple loss values ​​corresponding to the 1st to N-1th layers of the network.

[0012] The parameters of the neural network are adjusted according to multiple loss values ​​until the loss values ​​meet preset conditions, thus obtaining a trained neural network.

[0013] Where N is a positive integer greater than or equal to 4.

[0014] In some embodiments, constructing an N-layer neural network includes:

[0015] A neural network with N layers is constructed. Each layer has an encoder for acquiring an input feature map and a decoder for acquiring an output feature map corresponding to the input feature map. The encoders of layers 1 to N perform downsampling operations from top to bottom to acquire the input feature map, and the decoders of layers 1 to N perform upsampling operations from bottom to top to acquire the output feature map.

[0016] In some embodiments, the downsampling operation and the upsampling operation use the same sampling rate.

[0017] In some embodiments, during the process of determining the input feature maps and output feature maps of layers 2 to N-1 of the neural network, cross-semantic information fusion is performed on the input feature maps and output feature maps of different network layers, including:

[0018] In the process of determining the input feature maps and output feature maps of layers 2 to N-1 of the neural network, the input feature map of layer n and the output feature map of layer n+1 are fused with semantic information, and the output feature map of layer n and the input feature map of layer n+1 are fused with semantic information.

[0019] Where n is a positive integer greater than or equal to 2, and n is less than or equal to N-2.

[0020] In some embodiments, the input feature map of the n-layer network layer and the output feature map of the n+1-layer network layer are fused with semantic information, including:

[0021] The first bottom-level feature map corresponding to the input feature map of the n-layer network is extracted by using a set convolution operation.

[0022] The bottom feature map is interpolated according to a set interpolation operation to obtain a second bottom feature map corresponding to the size of the high-level feature map.

[0023] The second bottom-level feature map is fused with the output feature map of the n+1 layer to obtain the output feature map after semantic information fusion.

[0024] In some embodiments, the output feature map of the n-layer network is semantically fused with the input feature map of the n+1-layer network, including:

[0025] The first high-level feature map corresponding to the output feature map of the n-layer network is extracted by using a set convolution operation.

[0026] The first high-level feature map is interpolated according to the set interpolation operation to obtain a second high-level feature map corresponding to the size of the bottom feature map.

[0027] The second high-level feature map is fused with the input feature map of layer n+1 to obtain the output feature map after semantic information fusion.

[0028] In some embodiments, obtaining multiple loss values ​​corresponding to layers 1 to N-1 of the neural network based on the output feature maps and label image data of layers 1 to N-1 includes:

[0029] Based on the output feature maps and label image data after semantic information fusion of layers 1 to N-1 of the neural network, multiple loss values ​​corresponding to layers 1 to N-1 are obtained.

[0030] In some embodiments, adjusting the parameters of the neural network based on a plurality of loss values ​​until the loss values ​​meet preset conditions to obtain a trained neural network includes:

[0031] Based on the preset weight coefficients and multiple loss values, the total loss value corresponding to the neural network is obtained;

[0032] Backpropagation is performed based on the total loss value to adjust the parameters of the neural network until the total loss value is less than a preset threshold, thus obtaining a trained neural network.

[0033] A second aspect of this application provides a road sign recognition device based on multi-network layer loss fusion, comprising:

[0034] The acquisition module is used to acquire the image to be recognized;

[0035] The prediction module is used to predict the location region corresponding to the road sign in the image to be identified based on a preset prediction model.

[0036] The prediction module includes:

[0037] The building block is used to construct a neural network with N layers; where N is a positive integer greater than or equal to 4.

[0038] The fusion module is used to perform cross-semantic information fusion on the input feature maps and output feature maps of different network layers during the acquisition of input feature maps and output feature maps of layers 2 to N-1 of the neural network.

[0039] The loss module is used to obtain multiple loss values ​​corresponding to the 1st to N-1st network layers based on the output feature maps and label image data of the 1st to N-1st network layers of the neural network obtained by the construction module.

[0040] An adjustment module is used to adjust the parameters of the neural network according to multiple loss values ​​obtained by the loss module until the loss values ​​meet preset conditions, thereby obtaining a trained neural network.

[0041] Road sign recognition method. This application, in a third aspect, provides an electronic device comprising:

[0042] Processor; and

[0043] A memory that stores executable code, which, when executed by the processor, causes the processor to perform the method described above.

[0044] A fourth aspect of this application provides a computer-readable storage medium having executable code stored thereon, which, when executed by a processor of an electronic device, causes the processor to perform the method described above.

[0045] The technical solution provided in this application may include the following beneficial effects:

[0046] The technical solution of this application, in the process of obtaining the input feature maps and output feature maps of the 2nd to N-1th layers of the neural network, performs cross-semantic information fusion on the input feature maps and output feature maps of different layers of the neural network, and obtains multiple loss values ​​corresponding to the 1st to N-1th layers of the neural network based on the output feature maps and label image data, thereby realizing the fusion of semantic features of different levels of input feature maps and output feature maps, and establishing loss functions corresponding to different layers of the neural network to optimize the neural network, effectively improving the accuracy of the neural network in extracting image features, and further improving the mapping accuracy of high-precision maps.

[0047] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and do not limit this application. Attached Figure Description

[0048] The above and other objects, features and advantages of this application will become more apparent from the more detailed description of exemplary embodiments thereof in conjunction with the accompanying drawings, wherein the same reference numerals generally represent the same components in the exemplary embodiments thereof.

[0049] Figure 1 This is a schematic flowchart illustrating the road sign recognition method based on multi-network layer loss fusion as shown in the embodiments of this application;

[0050] Figure 2 This is another schematic diagram of the road sign recognition method based on multi-network layer loss fusion shown in the embodiments of this application;

[0051] Figure 3This is another schematic diagram of the road sign recognition method based on multi-network layer loss fusion shown in the embodiments of this application;

[0052] Figure 4 This is a schematic diagram illustrating the application of the traffic light and road sign recognition method based on multi-task traffic diversion in an embodiment of this application;

[0053] Figure 5 This is another schematic diagram of the road sign recognition method based on multi-network layer loss fusion shown in the embodiments of this application;

[0054] Figure 6 This is another schematic diagram of the road sign recognition method based on multi-network layer loss fusion shown in the embodiments of this application;

[0055] Figure 7 This is a diagram illustrating the effect of the road sign recognition method based on multi-network layer loss fusion as shown in the embodiments of this application;

[0056] Figure 8 This is a schematic diagram of the structure of a road sign recognition device based on multi-network layer loss fusion, as shown in an embodiment of this application;

[0057] Figure 9 This is a schematic diagram of the structure of an electronic device shown in an embodiment of this application. Detailed Implementation

[0058] Embodiments of this application will now be described in more detail with reference to the accompanying drawings. While embodiments of this application are shown in the drawings, it should be understood that this application may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided to make this application more thorough and complete, and to fully convey the scope of this application to those skilled in the art.

[0059] The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The singular forms “a,” “the,” and “the” used in this application and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any or all possible combinations of one or more of the associated listed items.

[0060] It should be understood that although the terms "first," "second," "third," etc., may be used in this application to describe various information, this information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this application, "multiple" means two or more, unless otherwise explicitly specified.

[0061] In related technologies, when using smoothing post-processing algorithms to process images, the accuracy of extracting image features is low, making it impossible to complete high-precision map image recognition tasks.

[0062] To address the aforementioned issues, this application provides a method for constructing a neural network based on multi-layer loss fusion. This method can fuse semantic features from different levels of the input and output feature maps, and establish loss functions corresponding to different network layers to optimize the neural network. This effectively improves the accuracy of the neural network in extracting image features and further enhances the mapping accuracy of high-precision maps.

[0063] The technical solutions of the embodiments of this application are described in detail below with reference to the accompanying drawings.

[0064] Figure 1 This is a flowchart illustrating a road sign recognition method based on multi-network layer loss fusion, as shown in an embodiment of this application. Figure 2 This is another flowchart illustrating the road sign recognition method based on multi-network layer loss fusion shown in the embodiments of this application.

[0065] See Figure 1 and Figure 2 The road sign recognition method based on multi-network layer loss fusion in this application includes:

[0066] S110, acquire the image to be recognized.

[0067] In this step, the image to be identified can be acquired as input data for the prediction model.

[0068] The image to be recognized can refer to the information about the vehicle's surrounding environment collected by the vehicle-mounted image acquisition device. The image to be recognized can be in picture format or image format. It should be understood that when processing an image format image to be recognized, the image data is obtained by extracting frames from the image to be recognized, and the frame extraction interval can be set according to usage requirements.

[0069] Among them, "vehicle terminal" can refer to the vehicle-mounted terminal, which can be an intelligent device such as an in-vehicle host or intelligent control equipment. The vehicle-mounted terminal can also be an in-vehicle environmental information collection system, which refers to an intelligent system used to collect information about the environment around the vehicle. Such information may include, but is not limited to, environmental information around the vehicle, road information, road sign information, traffic light information, etc.

[0070] S120, predict the location region in the image to be identified that corresponds to the road sign based on the preset prediction model.

[0071] In this step, a pre-defined prediction model can be used to predict the location region in the image to be recognized that corresponds to the road sign. For example, if the image to be recognized contains three road signs, after the image is input into the prediction model, the location information of the three road signs can be obtained, such as the coordinates or location regions of the three road signs on the recognition image. The method for constructing the prediction model includes:

[0072] S121, Construct an N-layer neural network.

[0073] In this step, N is a positive integer greater than or equal to 4, meaning the neural network has at least 4 basic network layers. Each basic network layer can obtain an input feature map and a corresponding output feature map.

[0074] In some embodiments, the input image data can be 3-channel image data with a resolution of 480*800. Of course, the image data can also be image data of other resolutions, such as 1920×1080 resolution image data. In other embodiments, the input image data can be image data converted to a preset format after resolution conversion. It is understood that the input image data can be image data in different resolution formats, and the image data can be converted to the corresponding preset resolution format before uploading. For example, if the input image data is 1920×1080 resolution image data, this image data can be converted to 480*800 resolution image data before input.

[0075] S122, In the process of acquiring the input feature maps and output feature maps of the 2nd to N-1th layers of the neural network, the input feature maps and output feature maps of different layers of the network are fused with cross semantic information.

[0076] In this step, cross-semantic information fusion is used to enable the neural network to make full use of high-level and low-level semantics, thereby improving the recognition accuracy of the neural network in recognizing images.

[0077] S123, Based on the output feature maps and label image data of the 1st to N-1th layers of the neural network, obtain multiple loss values ​​corresponding to the 1st to N-1th layers of the neural network.

[0078] In this step, a loss function is established based on the output feature maps and label image data obtained from layers 1 to N-1 of the neural network, and multiple loss values ​​corresponding to layers 1 to N-1 are obtained.

[0079] S124: Adjust the parameters of the neural network according to multiple loss values ​​until the loss value meets the preset conditions, and obtain the trained neural network.

[0080] In this step, based on whether the obtained multiple loss values ​​meet the preset conditions, the neural network is backpropagated using the multiple loss values ​​to adjust and optimize the parameters of the neural network.

[0081] The preset condition can be to determine whether the total loss value related to multiple loss values ​​is less than a preset threshold. The total loss value can be obtained based on preset weighting coefficients and the multiple loss values.

[0082] In this embodiment, the method of this application, during the construction of the prediction model, in the process of acquiring the input feature maps and output feature maps of different network layers 2 to N-1, performs cross-semantic information fusion on the input feature maps and output feature maps of different network layers. Furthermore, based on the output feature maps of the network layers 1 to N-1 and the label image data, multiple loss values ​​corresponding to the network layers 1 to N-1 are obtained respectively. This achieves the fusion of semantic features at different levels of the input feature maps and output feature maps, and establishes loss functions corresponding to different network layers to optimize the neural network, effectively improving the accuracy of the neural network in extracting image features and further improving the mapping accuracy of high-precision maps.

[0083] Figure 3 This is another flowchart illustrating the road sign recognition method based on multi-network layer loss fusion as shown in the embodiments of this application. Figure 4 This is a schematic diagram illustrating the application of the traffic light and road sign recognition method based on multi-task traffic splitting, as shown in an embodiment of this application. Figure 3 In this code, encoder is the encoder; decoder is the decoder; concat is a 1x1 concatenation operation; conv is a 3x3 convolution operation with padding of 1 and stride of 1; input is the input; output is the output; total loss is the total loss function; loss_ce is the cross-entropy loss; w1, w2, w3, and w4 are the weight coefficients of the four loss functions, respectively; and l1, l2, l3, and l4 represent loss_ce1, loss_ce2, loss_ce3, and loss_ce4, respectively.

[0084] See Figure 3 and Figure 4 The road sign recognition method based on multi-network layer loss fusion in this application includes:

[0085] S210, construct an N-layer neural network, where each layer has an encoder part for acquiring the input feature map and a decoder part for acquiring the output feature map corresponding to the input feature map.

[0086] An N-layer neural network is constructed, where each layer has an encoder and a decoder. The encoder acquires the input image data and then extracts an input feature map. The decoder, based on the input feature map from the encoder, extracts an output feature map corresponding to the input feature map.

[0087] In some embodiments, the encoder portions of layers 1 to N perform downsampling operations from top to bottom to obtain input feature maps. Specifically, the encoding portion of layer 1 can use convolution to obtain the input feature map based on the input image. It can be understood that the input feature maps of the encoder portions of layers 2 to N are all obtained by downsampling the input feature map of the previous layer.

[0088] In some embodiments, the decoder portions of layers 1 to N perform upsampling operations from bottom to top to obtain output feature maps. Specifically, when the decoder portion of layer N obtains the output feature map, it can do so using a 1x1 convolution operation based on the input feature map of the encoder portion of layer N. It can be understood that after the decoder portion of layer N obtains the output feature map, the input feature maps of the encoder portions of layers 1 to N-1 are all obtained by upsampling the output feature map of the next layer.

[0089] In some embodiments, the downsampling and upsampling operations of this application use the same sampling ratio. For example, the upsampling and downsampling operations use a ratio of 2. In this way, each network layer performs a downsampling operation, which doubles the number of channels in the feature map and reduces the image width and height to half of their original values. Each network layer performs an upsampling operation, which reduces the number of channels in the feature map to half of their original values ​​and doubles the image width and height.

[0090] In some embodiments, feature map fusion can be achieved between the encoder and decoder parts using concatenation and / or convolution operations. It is understood that feature map fusion can be performed based on the input feature map of the encoder part using concatenation and / or convolution operations to obtain the output feature map. For example, after the encoder part obtains the input feature map, it uses concatenation and convolution operations to fuse the output feature map of the next network layer with the input feature map, thereby obtaining the output feature map of the current layer's decoder part. It should be understood that after each convolution operation, the obtained feature map needs to be normalized and activated, for example, by sequentially normalizing and activating the feature map after the convolution operation using a conventional normalization function and activation function.

[0091] By employing downsampling in the encoder to form a top-down feature map acquisition path, and upsampling in the decoder to form a bottom-up feature map acquisition path, the neural network can form two corresponding feature pyramids through these two different feature map paths. Feature map fusion can be achieved between the encoder and decoder through concatenation and / or convolution operations, i.e., lateral connection between the two paths. Through the neural network structure described above, both low-level and high-level semantic information can be fully utilized, thereby improving the overall prediction performance of the neural network.

[0092] S220, during the process of determining the input feature maps and output feature maps of layers 2 to N-1 of the neural network, semantic information can be fused between the input feature map of layer n and the output feature map of layer n+1, and semantic information can be fused between the output feature map of layer n and the input feature map of layer n+1.

[0093] Cross-semantic information fusion refers to the semantic fusion of the input and output feature maps of two adjacent network layers with the output and input feature maps of the next layer, respectively. It can be understood as inputting the input feature map of the previous layer into the next layer and semantically fusing it with the output feature map of that layer; this fusion process integrates lower-level semantic features into the output feature map. Similarly, it can be understood as inputting the output feature map of the previous layer into the next layer and semantically fusing it with the input feature map of that layer; this fusion process integrates higher-level semantic features into the input feature map.

[0094] Where n is a positive integer greater than or equal to 2, and n is less than or equal to N-2. For example, if N is 5 and n is 2, then in the neural network, the second and third network layers perform cross-semantic information fusion, and the third and fourth network layers also perform cross-semantic information fusion. In other words, if N is 5 and n is 2, then the input and output feature maps of the second network layer are semantically fused with the output and input feature maps of the third network layer, respectively, and the input and output feature maps of the third network layer are semantically fused with the output and input feature maps of the fourth network layer, respectively.

[0095] Figure 5 This is another flowchart illustrating the road sign recognition method based on multi-network layer loss fusion as shown in the embodiments of this application.

[0096] Please see also Figure 5 In some embodiments, semantic information fusion of the input feature map of the n-layer network and the output feature map of the n+1-layer network may include the following steps:

[0097] S2211, using a set convolution operation to extract the first bottom layer feature map corresponding to the input feature map of the n-layer network layer.

[0098] The first low-level feature map corresponding to the input feature map of the nth network layer is extracted through convolution operations. The first low-level feature map corresponds to the low-level semantic features. The convolution operation is set to be a 3*3 convolution with padding of 1 and a stride of 1.

[0099] S2212, interpolate the bottom feature map according to the set interpolation operation to obtain the second bottom feature map corresponding to the size of the high-level feature map.

[0100] The acquired first-level feature map is interpolated to obtain a second-level feature map corresponding to the size of the output feature map of layer n+1. The interpolation operation can be bilinear interpolation.

[0101] S2213, perform channel fusion of the second bottom layer feature map and the output feature map of the n+1 layer to obtain the output feature map after semantic information fusion.

[0102] The second-layer feature map obtained through convolution and interpolation operations is fused with the corresponding output feature map through channels to obtain the output feature map of layer n+1 after semantic information fusion.

[0103] Figure 6 This is another flowchart illustrating the road sign recognition method based on multi-network layer loss fusion as shown in the embodiments of this application.

[0104] Please see also Figure 6In some embodiments, the semantic information fusion of the output feature map of the nth network layer and the input feature map of the n+1th layer may include the following steps:

[0105] S2221, the first high-level feature map corresponding to the output feature map of the n-layer network is extracted by setting the convolution operation.

[0106] The first high-level feature map corresponds to the high-level semantic features. The first high-level feature map corresponding to the output feature map of the n-layer network is extracted through convolution operations. The convolution operation is set to be a 3*3 convolution with padding of 1 and a stride of 1.

[0107] S2222, interpolate the first high-level feature map according to the set interpolation method to obtain the second high-level feature map corresponding to the size of the bottom feature map.

[0108] The acquired first high-level feature map is interpolated to obtain a second high-level feature map corresponding to the size of the input feature map of layer n+1. The interpolation operation can be a nearest neighbor interpolation operation.

[0109] S2223, perform channel fusion of the second high-level feature map with the input feature map of layer n+1 to obtain the output feature map after semantic information fusion.

[0110] The second high-level feature map obtained through convolution and interpolation operations is fused with the corresponding input feature map through channel fusion to obtain the input feature map after semantic information fusion at layer n+1.

[0111] S230: Based on the output feature map and label image data after semantic information fusion of layers 1 to N-1 of the neural network, obtain multiple loss values ​​corresponding to layers 1 to N-1 of the neural network.

[0112] Based on the output feature maps and labeled image data obtained after semantic information fusion from layers 1 to N-1 of the neural network, multiple loss values ​​corresponding to layers 1 to N-1 are obtained by establishing loss functions for different network layers. The encoding and decoding parts of layer N can transfer feature maps via 1*1 convolutions. That is, the encoding part of layer N uses a 1*1 convolution to input the obtained input feature map to the decoding part, so that the decoding part obtains an output feature map corresponding to the input feature map.

[0113] It should be noted that the process of establishing the loss function based on the output feature map after semantic information fusion and the label image data requires first scaling the output feature map to the size of the target image. For example, bilinear interpolation can be used to scale the output feature map after semantic information fusion to the same size as the target image. This ensures that the output feature map corresponds to the label image.

[0114] In some embodiments, the loss function may be the cross-entropy loss function, and the loss value corresponds to the cross-entropy loss value.

[0115] S240: Based on preset weight coefficients and multiple loss values, obtain the total loss value corresponding to the neural network.

[0116] Preset weight coefficients correspond to multiple loss values. Based on these preset weight coefficients and multiple loss values, a total loss function is established. This total loss function is used to obtain the total loss value corresponding to the neural network. For example, if the total loss is "Total loss" and the loss values ​​are loss_ce1, loss_ce2, loss_ce3, and loss_ce4, the preset weights can be w1, w2, w3, and w4, where w1, w2, w3, and w4 correspond to loss_ce1, loss_ce2, loss_ce3, and loss_ce4, respectively. The total loss value can be the sum of w1*loss_ce1, w2*loss_ce2, w3*loss_ce3, and w4*loss_ce4.

[0117] For example, if loss_ce1, loss_ce2, loss_ce3, and loss_ce4 are represented by l1, l2, l3, and l4 respectively, the total loss function can be established as: Total loss = w1*l1 + w2*l2 + w3*l3 + w4*l4.

[0118] S250 performs backpropagation based on the total loss value to adjust the parameters of the neural network until the total loss value is less than a preset threshold, thus obtaining a trained neural network.

[0119] By determining whether the total loss value is less than a preset threshold, backpropagation is performed on the neural network based on the total loss value to adjust and optimize the network's parameters. Specifically, the weight coefficients corresponding to the convolution calculations in the neural network can be adjusted and optimized.

[0120] The preset threshold can be set according to the user's actual needs. When the total loss value is less than the preset threshold, it is determined that the total loss value is converging, and the adjustment and optimization of the neural network is stopped, resulting in a trained neural network.

[0121] In this calculation, the preset weight coefficients for multiple loss values ​​can be assigned the same weight during the initial calculation. For example, if there are four loss values, the weight coefficients for the four loss values ​​that generate the total loss value can all be 0.25. When backpropagating the neural network with the total loss value, the preset weight coefficients for multiple loss values ​​can also be automatically optimized and adjusted.

[0122] Figure 7 This is a diagram illustrating the effect of the road sign recognition method based on multi-network layer loss fusion as shown in the embodiments of this application.

[0123] See Figure 7 The first image is an image without the application of the technical solution of this application, and the second image is an effect diagram with the application of the technical solution of this application. Specifically, after processing the image data with the technical solution of this application, the edges of the traffic light recognition areas at the four rectangular markers are significantly improved. The edges of the traffic light recognition areas in the latter image are smoother and more accurate than those in the former image.

[0124] In this embodiment, the method of this application employs downsampling in the decoder part of the neural network and upsampling in the decoder part, enabling the neural network to form two corresponding feature pyramids. The encoder part and the decoder part can be horizontally connected by concatenation and / or convolution operations. Through the neural network with the above structure, both low-level and high-level semantic information are fully utilized, thereby improving the overall prediction effect of the neural network. Furthermore, the total loss function corresponding to different network layers is used to optimize the neural network, effectively improving the accuracy of the neural network in extracting image features.

[0125] Corresponding to the aforementioned application function implementation method embodiments, this application also provides a road sign recognition device, electronic device, and corresponding embodiments based on multi-network layer loss fusion.

[0126] Figure 8 This is a schematic diagram of the structure of a road sign recognition device based on multi-network layer loss fusion, as shown in an embodiment of this application.

[0127] See Figure 8 The road sign recognition device based on multi-network layer loss fusion of this application includes: an acquisition module 410 and a prediction module 420.

[0128] The acquisition module 410 is used to acquire the image to be recognized.

[0129] The prediction module 420 is used to predict the location region corresponding to the road sign in the image to be identified based on a preset prediction model.

[0130] The prediction module 420 includes a construction module 421, a fusion module 422, a loss module 423, and an adjustment module 424.

[0131] Module 421 is used to construct a neural network with N layers. Here, N is a positive integer greater than or equal to 4.

[0132] In some embodiments, each layer of the neural network constructed by the construction module 421 has an encoder portion for acquiring an input feature map and a decoder portion for acquiring an output feature map corresponding to the input feature map. The encoder portions of layers 1 to N perform downsampling operations from top to bottom to acquire the input feature map. The decoder portions of layers 1 to N perform upsampling operations from bottom to top to acquire the output feature map.

[0133] In some embodiments, feature map fusion can be achieved between the encoder and decoder parts of the neural network constructed by the building module 421 through splicing and / or convolution operations.

[0134] The fusion module 422 is used to determine the cross-semantic information fusion of the input feature maps and output feature maps of different network layers during the acquisition process of the 2nd to N-1th layers of the neural network obtained by the construction module 421.

[0135] In some embodiments, the fusion module 422 is used to determine the input feature maps and output feature maps of layers 2 to N-1 of the neural network. During the acquisition process, the input feature map of the n-layer network layer and the output feature map of the n+1-layer network layer may be fused with semantic information, and the output feature map of the n-layer network layer may be fused with the input feature map of the n+1-layer network layer.

[0136] In some embodiments, the fusion module 422 is used to extract a first low-level feature map corresponding to the input feature map of the n-layer network layer using a set convolution operation; interpolate the low-level feature map according to a set interpolation operation to obtain a second low-level feature map corresponding to the size of the high-level feature map; and perform channel fusion of the second low-level feature map with the output feature map of the n+1-layer network layer to obtain an output feature map after semantic information fusion. In other embodiments, the fusion module 422 is further used to extract a first high-level feature map corresponding to the output feature map of the n-layer network layer using a set convolution operation; interpolate the first high-level feature map according to a set interpolation operation to obtain a second high-level feature map corresponding to the size of the low-level feature map; and perform channel fusion of the second high-level feature map with the input feature map of the n+1-layer network layer to obtain an output feature map after semantic information fusion.

[0137] The loss module 423 is used to obtain multiple loss values ​​corresponding to the 1st to N-1th network layers respectively, based on the output feature maps and label image data of the 1st to N-1th network layers of the neural network obtained by the construction module 421.

[0138] In some embodiments, the loss module 423 is used to obtain multiple loss values ​​corresponding to the 1st to N-1th network layers based on the output feature map and label image data after semantic information fusion of the 1st to N-1th network layers of the neural network.

[0139] The adjustment module 424 is used to adjust the parameters of the neural network according to the multiple loss values ​​obtained by the loss module 423 until the loss values ​​meet the preset conditions, thereby obtaining a trained neural network.

[0140] In some embodiments, the adjustment module 424 is used to obtain the total loss value corresponding to the neural network based on preset weight coefficients and multiple loss values; perform backpropagation operation based on the total loss value to adjust the parameters of the neural network until the total loss value is less than a preset threshold, thereby obtaining a trained neural network.

[0141] In this embodiment, the apparatus of this application can perform cross-semantic information fusion on the input feature maps and output feature maps of different network layers during the process of determining the input feature maps and output feature maps of the 2nd to N-1th layers of the neural network. This achieves the fusion of semantic features of different levels of input feature maps and output feature maps, and establishes loss functions corresponding to different network layers to optimize the neural network. This effectively improves the accuracy of the neural network in extracting image features and further improves the mapping accuracy of high-precision maps.

[0142] Regarding the apparatus in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments related to the method, and will not be elaborated further here.

[0143] Figure 9 This is a schematic diagram of the structure of an electronic device shown in an embodiment of this application.

[0144] See Figure 9 The electronic device 1000 includes a memory 1010 and a processor 1020.

[0145] The processor 1020 can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor.

[0146] Memory 1010 may include various types of storage units, such as system memory, read-only memory (ROM), and permanent storage devices. ROM may store static data or instructions required by processor 1020 or other modules of the computer. Permanent storage devices may be read-write storage devices. Permanent storage devices may be non-volatile storage devices that retain stored instructions and data even when the computer is powered off. In some embodiments, permanent storage devices use mass storage devices (e.g., magnetic or optical disks, flash memory) as permanent storage devices. In other embodiments, permanent storage devices may be removable storage devices (e.g., floppy disks, optical drives). System memory may be a read-write storage device or a volatile read-write storage device, such as dynamic random access memory. System memory may store some or all of the instructions and data required by the processor during operation. Furthermore, memory 1010 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (e.g., DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), and disks and / or optical disks may also be used. In some embodiments, the memory 1010 may include a removable storage device that is readable and / or writable, such as a laser disc (CD), a read-only digital multifunction optical disc (e.g., DVD-ROM, dual-layer DVD-ROM), a read-only Blu-ray disc, a high-density optical disc, a flash memory card (e.g., SD card, mini SD card, Micro-SD card, etc.), a magnetic floppy disk, etc. Computer-readable storage media do not contain carrier waves or transient electronic signals transmitted wirelessly or via wired connections.

[0147] The memory 1010 stores executable code, which, when processed by the processor 1020, can cause the processor 1020 to execute part or all of the methods described above.

[0148] Furthermore, the method according to this application can also be implemented as a computer program or computer program product, which includes computer program code instructions for performing some or all of the steps in the method described above.

[0149] Alternatively, this application may be implemented as a computer-readable storage medium (or a non-transitory machine-readable storage medium or a machine-readable storage medium) storing executable code (or computer program or computer instruction code) thereon, which, when executed by a processor of an electronic device (or server, etc.), causes the processor to perform part or all of the steps of the methods described above according to this application.

[0150] The various embodiments of this application have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical application, or improvement of the technology in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.

Claims

1. A road sign recognition method based on multi-network layer loss fusion, characterized in that: Acquire the image to be recognized; The location region corresponding to the road sign in the image to be identified is predicted according to a preset prediction model; The method for constructing the prediction model includes: Construct a neural network with N layers; In the process of acquiring the input and output feature maps of layers 2 to N-1 of a neural network, cross-semantic information fusion is performed on the input and output feature maps of different layers. This includes: determining that during the acquisition of the input and output feature maps of layers 2 to N-1 of the neural network, semantic information fusion is performed between the input feature map of layer n and the output feature map of layer n+1, and semantic information fusion is performed between the output feature map of layer n and the input feature map of layer n+1; where n is a positive integer greater than or equal to 2, and n is less than or equal to N-2. Based on the output feature maps and label image data of the 1st to N-1th layers of the neural network, obtain multiple loss values ​​corresponding to the 1st to N-1th layers of the network. The parameters of the neural network are adjusted according to multiple loss values ​​until the loss values ​​meet preset conditions, thereby obtaining a trained neural network; wherein, the process includes: obtaining a total loss value corresponding to the neural network based on preset weight coefficients and multiple loss values; performing backpropagation based on the total loss value to adjust the parameters of the neural network until the total loss value is less than a preset threshold, thereby obtaining a trained neural network; Where N is a positive integer greater than or equal to 4.

2. The method of claim 1, wherein, Each layer of the neural network has an encoder portion for acquiring an input feature map and a decoder portion for acquiring an output feature map corresponding to the input feature map; The encoder part of the network layers 1 to N performs downsampling operations from top to bottom to obtain the input feature map; the decoder part of the network layers 1 to N performs upsampling operations from bottom to top to obtain the output feature map.

3. The method of claim 2, wherein, The downsampling and upsampling operations use the same sampling rate.

4. The method of claim 1, wherein, The input feature map of the nth network layer and the output feature map of the n+1th layer are fused with semantic information, including: The first bottom-level feature map corresponding to the input feature map of the n-layer network is extracted by using a set convolution operation. The bottom feature map is interpolated according to a set interpolation operation to obtain a second bottom feature map corresponding to the size of the high-level feature map. The second bottom-level feature map is fused with the output feature map of the n+1 layer to obtain the output feature map after semantic information fusion.

5. The method of claim 1, wherein, The output feature map of the nth layer of the network is semantically fused with the input feature map of the n+1th layer, including: The first high-level feature map corresponding to the output feature map of the n-layer network is extracted by using a set convolution operation. The first high-level feature map is interpolated according to the set interpolation operation to obtain a second high-level feature map corresponding to the size of the bottom feature map. The second high-level feature map is fused with the input feature map of layer n+1 to obtain the output feature map after semantic information fusion.

6. A signboard recognition device based on multi-network layer loss fusion, characterized by, include: The acquisition module is used to acquire the image to be recognized; The prediction module is used to predict the location region corresponding to the road sign in the image to be identified based on a preset prediction model. The prediction module includes: The building block is used to construct a neural network with N layers; where N is a positive integer greater than or equal to 4. The fusion module is used to perform cross-semantic information fusion on the input feature maps and output feature maps of different network layers during the acquisition of input feature maps and output feature maps of layers 2 to N-1 of the neural network. This includes: determining that during the acquisition of input feature maps and output feature maps of layers 2 to N-1 of the neural network, semantic information fusion is performed between the input feature map of layer n and the output feature map of layer n+1, and semantic information fusion is performed between the output feature map of layer n and the input feature map of layer n+1; where n is a positive integer greater than or equal to 2, and n is less than or equal to N-2. The loss module is used to obtain multiple loss values ​​corresponding to the 1st to N-1st network layers based on the output feature maps and label image data of the 1st to N-1st network layers of the neural network obtained by the construction module. An adjustment module is used to adjust the parameters of the neural network according to multiple loss values ​​obtained by the loss module until the loss value meets a preset condition, thereby obtaining a trained neural network; wherein the adjustment includes: obtaining a total loss value corresponding to the neural network according to preset weight coefficients and multiple loss values; performing a backpropagation operation based on the total loss value to adjust the parameters of the neural network until the total loss value is less than a preset threshold, thereby obtaining a trained neural network.

7. An electronic device, comprising: include: processor; as well as A memory having executable code stored thereon, which, when executed by the processor, causes the processor to perform the method as described in any one of claims 1-5.

8. A computer-readable storage medium having executable code stored thereon, characterized in that: When the executable code is executed by the processor of the electronic device, the processor performs the method as described in any one of claims 1-5.