Digital-twin-oriented 5g smart factory small target detection method and device
By fusing features from image and radar data and utilizing attention mapping and coding techniques for obstacle detection, the problem of handling robots being unable to promptly identify obstacles in harsh environments has been solved, thus improving the accuracy and safety of navigation and obstacle avoidance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INST OF AUTOMATION CHINESE ACAD OF SCI
- Filing Date
- 2023-03-23
- Publication Date
- 2026-06-12
AI Technical Summary
In harsh environments, transport robots are unable to identify obstacles in a timely and effective manner, leading to operational inconvenience.
By receiving image data and radar data sent by the handling robot, image features are determined using channel attention maps and spatial attention maps, and feature fusion is performed in conjunction with radar data encoding to perform target detection.
It improves the accuracy of navigation and obstacle avoidance for handling robots, ensures the safety of obstacle avoidance and handling, and provides favorable operational support.
Smart Images

Figure CN116524236B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of smart factory technology, and in particular to a method and apparatus for detecting small targets in a 5G smart factory based on digital twins. Background Technology
[0002] Building upon the foundation of the digital factory, the smart factory leverages the Internet of Things (IoT) and monitoring technologies to enhance the management and service of production information, thereby improving the controllability of the production process, reducing human intervention, and facilitating collaborative workflows. The smart factory integrates intelligent systems and methods, making it crucial for constructing a new generation of efficient, energy-saving, green, and environmentally friendly factories.
[0003] With the continuous iteration and updating of artificial intelligence (AI) technology, more and more AI technologies are rapidly developing and being applied to the field of smart factories. With the addition of AI technology, factory management is becoming increasingly intelligent. In closed environments, the participants in road behavior are relatively fixed, making obstacle avoidance and landing by transport robots highly practical.
[0004] When the handling environment is harsh, the handling robot itself cannot judge the information of obstacles in a timely and effective manner, which brings inconvenience to the operation of the handling robot. Summary of the Invention
[0005] To address the problems existing in the prior art, this invention provides a method and apparatus for detecting small targets in a 5G smart factory based on digital twins.
[0006] This invention provides a method for small target detection in a 5G smart factory based on digital twins, comprising:
[0007] The system receives image data and radar data transmitted by the transport robot; the image data is the image data in front of the transport robot; the radar data is the radar data of obstacles around the transport robot.
[0008] Based on the channel attention map and spatial attention map of the image data, image features are determined;
[0009] The radar data is encoded to determine radar characteristics;
[0010] The image features and the radar features are fused to determine the target features;
[0011] Target detection is performed based on the aforementioned target features.
[0012] In some embodiments, before determining image features based on the channel attention map and spatial attention map of the image data, the method further includes:
[0013] The image data is downsampled to determine multiple first feature maps;
[0014] The first feature map is fused based on dilated convolution and attention mechanisms to determine the second feature map;
[0015] Perform nonlocal operations on the second feature map to determine the third feature map;
[0016] The target feature map is determined based on the second feature map and the third feature map;
[0017] Based on the target feature map, the channel attention map and the spatial attention map are determined.
[0018] In some embodiments, determining the channel attention map based on the target feature map includes:
[0019] Based on pooling operations, the spatial information of the target feature map is aggregated to determine the first average pooling feature and the first max pooling feature of the target feature map.
[0020] The channel attention map is determined based on the multilayer perceptron function, the first average pooling feature, and the first max pooling feature.
[0021] In some embodiments, determining the spatial attention map based on the target feature map includes:
[0022] Based on pooling operations, the channel information of the target feature map is aggregated to determine the second average pooling feature and the second max pooling feature of the target feature map.
[0023] The spatial attention map is determined by convolving the second average pooling feature and the second max pooling feature.
[0024] In some embodiments, the formula for determining image features based on the channel attention map and spatial attention map of the image data is as follows:
[0025]
[0026] Among them, F image Representing image features, Let S represent the weight matrix, S represent the channel attention map, and M represent the channel attention map. s Represents a spatial attention map.
[0027] In some embodiments, the formula for fusing the image features and the radar features to determine the target features is as follows:
[0028] F fusion =σ(W1τ(W0(MaxPool(F) image )))+W1τ(W0(MaxPool(F radar))))
[0029] Among them, F fusion Let σ represent the target feature, F represent the sigmoid function, W1 and W0 represent the weight parameters, and F... image F represents image features. radar This represents radar characteristics, and MaxPool represents maximum pooling.
[0030] This invention also provides a small target detection device for 5G smart factories based on digital twins, comprising:
[0031] The receiving module is used to receive image data and radar data sent by the transport robot; the image data is image data of the front of the transport robot; the radar data is radar data of obstacles around the transport robot.
[0032] The first determining module is used to determine image features based on the channel attention map and spatial attention map of the image data;
[0033] The second determining module is used to encode the radar data and determine radar characteristics;
[0034] The fusion module is used to fuse the image features and the radar features to determine the target features;
[0035] The detection module is used to perform target detection based on the target features.
[0036] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes the program to implement the small target detection method for 5G smart factories oriented towards digital twins as described above.
[0037] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the small target detection method for 5G smart factories oriented towards digital twins as described above.
[0038] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the small target detection method for 5G smart factories oriented towards digital twins as described above.
[0039] The present invention provides a method, device, electronic device and storage medium for small target detection in 5G smart factories based on digital twins. It determines image features based on the channel attention map and spatial attention map of image data, encodes radar data to determine radar features, and effectively fuses image features and radar features to perform obstacle detection. This improves the accuracy of obstacle avoidance navigation of the handling robot, ensures the safety of the robot in obstacle avoidance and handling, and provides a favorable guarantee for the safe and stable operation of the handling robot. Attached Figure Description
[0040] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0041] Figure 1 This is a flowchart illustrating the small target detection method for 5G smart factories based on digital twins provided in this embodiment of the invention.
[0042] Figure 2 This is a schematic diagram of the structure of the 5G smart factory small target detection system for digital twins provided in an embodiment of the present invention;
[0043] Figure 3 This is a schematic diagram of the structure of the handling robot module provided in an embodiment of the present invention;
[0044] Figure 4 This is a schematic diagram of the structure of the digital twin platform provided in an embodiment of the present invention;
[0045] Figure 5 This is a flowchart illustrating the 5G smart factory small target detection system for digital twins provided in an embodiment of the present invention.
[0046] Figure 6 This is a schematic diagram of the network structure for fusing image data and radar data provided in an embodiment of the present invention;
[0047] Figure 7 This is a schematic diagram of the structure of the 5G smart factory small target detection device for digital twins provided in an embodiment of the present invention;
[0048] Figure 8 This is a schematic diagram of the structure of the electronic device provided in an embodiment of the present invention. Detailed Implementation
[0049] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0050] The terms "first," "second," etc., used in the specification and claims of this invention are used to distinguish similar objects and not to describe a specific order or sequence. It should be understood that such terms can be used interchangeably where appropriate so that embodiments of the invention can be implemented in orders other than those illustrated or described herein, and the objects distinguished by "first" and "second" are generally of the same class, not limited in number; for example, the first object can be one or more. Furthermore, in the specification and claims, "and / or" indicates at least one of the connected objects, and the character " / " generally indicates that the preceding and following objects are in an "or" relationship.
[0051] Digital twins, also known as digital twins or digital mirrors, are simulations that fully utilize data from physical models, sensor updates, and operational history. They integrate multidisciplinary, multi-scale, and multi-probabilistic processes to create a virtual representation that reflects the entire lifecycle of the corresponding physical equipment.
[0052] With the innovation of technologies such as the Internet of Things (IoT), big data, and mobile internet, the industrial revolution is gradually being put on the agenda, and industrial transformation is entering a substantial stage. Smart factories have become an important practical model for the intelligent development of industry. Building upon digital factories, smart factories utilize IoT and monitoring technologies to strengthen the management and service of factory production information, improve the controllability of the production process, reduce human intervention, and facilitate collaborative scheduling. Smart factories also integrate intelligent systems and methods, making them crucial for building a new generation of efficient, energy-saving, green, and environmentally friendly factories.
[0053] With the continuous iteration and updating of artificial intelligence (AI) technology, more and more AI technologies are rapidly developing and being applied to the field of smart factories. With the addition of AI technology, factory management is becoming increasingly intelligent. In closed environments, the participants in road behavior are relatively fixed, making obstacle avoidance and landing for transport robots highly practical. More and more transport robots are utilizing perception sensors for assisted decision-making. Millimeter-wave radar has become an indispensable sensor in obstacle detection. The radar data it provides can offer rich geometric information and precise distance descriptions, which is very helpful in understanding the target location in transport scenarios. However, the sparse, disordered, and uneven distribution of radar limits its use. Camera images, on the other hand, contain more regular and dense pixels and possess rich semantic information, such as color, but lack depth and positional information. Therefore, the complementary information from millimeter-wave radar and cameras allows the data fused from the two modes to enable transport robots to make decisions and control through a digital twin platform. Millimeter-wave radar can provide precise obstacle information around the transport robot, and combined with image information, it can more intuitively display target information around the robot, playing an important role in the robot's emergency obstacle avoidance and braking.
[0054] Figure 1 This is a flowchart illustrating the small target detection method for 5G smart factories based on digital twins provided in this embodiment of the invention. Figure 1 As shown, the 5G smart factory small target detection method for digital twins provided in this embodiment of the invention includes:
[0055] Step 101: Receive image data and radar data sent by the transport robot; the image data is the image data in front of the transport robot; the radar data is the radar data of obstacles around the transport robot.
[0056] Step 102: Determine image features based on the channel attention map and spatial attention map of the image data;
[0057] Step 103: Encode the radar data to determine radar characteristics;
[0058] Step 104: Fuse the image features and the radar features to determine the target features;
[0059] Step 105: Target detection is performed based on the target features.
[0060] It should be noted that the execution subject of the 5G smart factory small target detection method for digital twins provided by this invention can be an electronic device, a component in an electronic device, an integrated circuit, or a chip. The electronic device can be a mobile electronic device or a non-mobile electronic device. For example, mobile electronic devices can be mobile phones, tablets, laptops, PDAs, in-vehicle electronic devices, wearable devices, ultra-mobile personal computers (UMPCs), netbooks, or personal digital assistants (PDAs), etc., while non-mobile electronic devices can be servers, network-attached storage (NAS), personal computers (PCs), televisions (TVs), ATMs, or self-service machines, etc. This invention does not impose specific limitations.
[0061] In step 101, image data and radar data sent by the handling robot are received.
[0062] Millimeter-wave radar can be positioned at the top corner and body of the transport robot to collect radar data on obstacles around the robot. High-definition cameras can be positioned in front of or above the robot to collect image data from its front.
[0063] In step 102, image features are determined based on the channel attention map and spatial attention map of the image data.
[0064] Optionally, before determining image features based on the channel attention map and spatial attention map of the image data, the method further includes:
[0065] The image data is downsampled to determine multiple first feature maps;
[0066] The first feature map is fused based on dilated convolution and attention mechanisms to determine the second feature map;
[0067] Perform nonlocal operations on the second feature map to determine the third feature map;
[0068] The target feature map is determined based on the second feature map and the third feature map;
[0069] Based on the target feature map, the channel attention map and the spatial attention map are determined.
[0070] 2D feature extraction can be performed on the collected image data using a feature pyramid network. By downsampling, the backbone network is convolved from bottom to top to output features {C2,C3,C4,C5}, which is the first feature map.
[0071] C5 is convolved with a 1x1 matrix to obtain P5. P5 is upsampled by 2 and added to C4 after a 1x1 convolution to obtain P4. P4 is upsampled by 2 and added to C3 after a 1x1 convolution to obtain P3. P3 is upsampled by 2 and added to C2 after a 1x1 convolution to obtain P2.
[0072] In the P4 fusion stage, dilated convolution is added to increase the detector's receptive field and reduce the impact of downsampling. The calculation formula is as follows:
[0073]
[0074] Where, N in For input features, N out s is the output feature, p is the stride, and f is the padding factor.
[0075] That is, in the P4 fusion stage, the input features are P5 and C4, and after adding dilated convolution, the output feature is P4.
[0076] Attention mechanisms were incorporated into the P2 and P3 fusion stages to maximize the preservation of image features in the lower layers and extract key feature points. The specific calculation process is as follows:
[0077]
[0078] Where X and Y represent input and output features with the same dimension, (H, W, and C represent the height, width, and channels of the feature map, respectively), and j and i represent the position indices of the input and output features, respectively.
[0079] f(X i ,X j The function is used to calculate X. i and X j Similarity, g(X) j The function Q(x) is used to calculate the feature representation of position j, where Q(x) is the normalization parameter, and a Gaussian function is used as the similarity function. The specific calculation process is as follows:
[0080]
[0081] θ(X i ) = W θ X i ,
[0082] Among them, W θ and This represents the weight matrix.
[0083] In the P3 fusion stage, the input features are P4 and C3, and after adding the attention mechanism, the output feature is P3.
[0084] In the P2 fusion stage, the input features are P3 and C2. After adding the attention mechanism, the output feature is P2.
[0085] Using the above method, the first feature map {C2,C3,C4,C5} is fused based on dilated convolution and spatial attention to obtain the second feature map {P2,P3,P4,P5}.
[0086] Perform nonlocal operations on the obtained second feature map to obtain the third feature map W. Z Y, the second and third feature maps are superimposed through a residual network to obtain the target feature map, as shown in the following expression:
[0087] Z = W Z Y+X
[0088] Where Z represents the target feature map, W Z Let Y represent the weight matrix, Y represent the output features, and X represent the input features.
[0089] In some embodiments, determining the channel attention map based on the target feature map includes:
[0090] Based on pooling operations, the spatial information of the target feature map is aggregated to determine the first average pooling feature and the first max pooling feature of the target feature map.
[0091] The channel attention map is determined based on the multilayer perceptron function, the first average pooling feature, and the first max pooling feature.
[0092] To utilize channel attention to capture dependencies between feature maps and enhance image features on both the context and channels, average pooling and max pooling can be used to significantly improve the network's representational power, generating two distinct spatial context descriptors: F c avg and F c max represents the average pooling feature and the max pooling feature, respectively.
[0093] The two descriptors are then forwarded to a shared network to generate the channel attention graph M. c ∈R C×1×1 .
[0094] The shared network consists of a multilayer perceptron (MLP). To reduce parameter overhead, the activation size of the hidden layer is set to R. C / r×1×1 The operation process for channel attention is as follows:
[0095] M c (Z)=σ(MLP(AvgPool(Z))+MLP(MaxPool(Z)))
[0096] Among them, M c (Z) represents channel attention, σ represents the sigmoid function, MLP represents the multilayer perceptron, Z represents the target feature map, AvgPool represents average pooling, and MaxPool represents max pooling.
[0097] The obtained target feature map Z is processed by M c The module's operation yields the channel attention map S, and the specific calculation process is as follows:
[0098] S = Z * M c (Z)
[0099] Where S represents the channel attention map, Z represents the target feature map, and M represents the target feature map. c (Z) represents channel attention.
[0100] In some embodiments, determining the spatial attention map based on the target feature map includes:
[0101] Based on pooling operations, the channel information of the target feature map is aggregated to determine the second average pooling feature and the second max pooling feature of the target feature map.
[0102] The spatial attention map is determined by convolving the second average pooling feature and the second max pooling feature.
[0103] Spatial attention maps can be generated using the spatial relationships of features. Unlike channel attention, spatial attention focuses on "where" as an information component, thus complementing channel attention.
[0104] To compute spatial attention, average pooling and max pooling operations are first applied along the channel axis and then concatenated to generate effective feature descriptors. Applying pooling operations along the channel axis has been shown to effectively highlight informative regions.
[0105] Convolutional layers are applied to the cascaded feature descriptors to generate a spatial attention map M. s (X)∈R H×W×C It encodes the location to be emphasized or suppressed.
[0106] Two 2D feature maps are generated by aggregating the channel information of the feature maps using two pooling operations: Fs avg∈R 1 ×H×W and F s max∈R 1×H×W , where represent the average pooling feature and the max pooling feature in the channel, respectively.
[0107] They are then concatenated and convolved through a standard convolutional layer to generate a 2D feature space attention map, as shown in the following expression:
[0108] M s (X)=σ(f 7×7 ([AvgPool(X);MaxPool(X)]))
[0109] Among them, M s (X) represents the spatial attention map, σ represents the sigmoid function, and f 7×7 This indicates a convolution operation with a filter size of 7×7. AvgPool represents average pooling, and MaxPool represents max pooling.
[0110] In some embodiments, the formula for determining image features based on the channel attention map and spatial attention map of the image data is as follows:
[0111]
[0112] Among them, F image Representing image features, Let S represent the weight matrix, S represent the channel attention map, and M represent the channel attention map. s Represents a spatial attention map.
[0113] Optionally, the channel attention map S and the spatial attention map M are... s Superimposed on pixels, the calculation formula is as follows:
[0114]
[0115] Among them, F image Representing image features, Let S represent the weight matrix, S represent the channel attention map, and M represent the channel attention map. s Represents a spatial attention map.
[0116] In step 103, the radar data is encoded to determine radar characteristics.
[0117] To reduce radar data instability, the acquired raw radar data can be time- and location-encoded to obtain a time-coded radar data map. The k-th frame radar data map is calculated as follows:
[0118]
[0119] in, Let F represent the radar data of the k-th frame at time t, where F represents the original radar data, n represents the total number of radar frames, k represents the k-th frame, k∈[0,n+1], and t represents time.
[0120] The encoded radar data map is processed using a multilayer perceptron algorithm and fusion parameters to obtain the processed radar feature map, as shown in the following expression:
[0121]
[0122] in, This represents the processed radar feature map of the k-th frame at time t. Let represent the radar data map of the kth frame at time t, c represent the fusion parameters, and MLP represent the multilayer perceptron.
[0123] The processed radar feature maps are concatenated to obtain the following radar features:
[0124]
[0125] Among them, F radar This represents the overall radar signature after processing; cat indicates cascaded operation. This represents the radar feature map of the k-th frame at time t after processing.
[0126] In step 104, the image features and the radar features are fused to determine the target features.
[0127] By designing a feature fusion network, data from two sensors are processed separately to enable the interaction of feature maps of perceptual information obtained from two different sensors.
[0128] Optionally, the formula for fusing the image features and the radar features to determine the target features is as follows:
[0129] F fusion =σ(W1τ(W0(MaxPool(F) image )))+W1τ(W0(MaxPool(F radar ))))
[0130] Among them, F fusion σ represents the target feature after fusion, τ represents the sigmoid function, τ represents the ReLU function, and W1 and W0 represent the weight parameters.
[0131] In step 105, target detection is performed based on the target features.
[0132] Based on the fused target features, accurate images and location information of obstacle targets can be obtained, enabling obstacle detection and improving the accuracy of navigation and obstacle avoidance for handling robots, thus providing a favorable guarantee for the safe and stable operation of handling robots.
[0133] The small target detection method for 5G smart factories based on digital twins provided in this invention determines image features based on the channel attention map and spatial attention map of image data, encodes radar data to determine radar features, and effectively fuses image features and radar features to perform obstacle detection. This improves the accuracy of obstacle avoidance navigation for handling robots, ensures the safety of obstacle avoidance and handling by robots, and provides favorable protection for the safe and stable operation of handling robots.
[0134] Figure 2 This is a schematic diagram of the structure of the 5G smart factory small target detection system for digital twins provided in an embodiment of the present invention, as shown below. Figure 2 As shown in the figure, the 5G smart factory small target detection system for digital twins provided in this embodiment of the invention includes: a handling robot module, a digital twin platform, and a control module.
[0135] Figure 3 This is a structural schematic diagram of the handling robot module provided in an embodiment of the present invention, as shown below. Figure 3 As shown, the handling robot module includes:
[0136] Bus control unit, used for connection control between handling robots and various types of sensors.
[0137] Millimeter-wave radar equipment is installed at the top corner and body of the vehicle to collect data on obstacles around the transport robot. The radar data is a radar point cloud, which can be represented as a set of points, each of which can be represented as (x, y, z, v, p), where x, y, z represent the XYZ coordinate data of the radar point cloud, v represents the Doppler velocity, and p represents the energy of the point.
[0138] A high-definition camera device is installed in front of or above the robot to collect image data in front of the robot, including 1280×720 RGB images.
[0139] The first communication unit is used to send sensor information to the digital twin platform.
[0140] Optionally, the bus control unit includes a CAN card interface for connecting and communicating with sensor devices, and sending the collected sensor data to the first communication unit.
[0141] Optionally, the bus control unit further includes: multiple operation subunits connected to the 5G smart factory small target detection system for digital twins, used for operating the digital twin platform and generating corresponding control signals.
[0142] The digital twin platform is used to create virtual 3D transport robot operation routes through digital twins. It can map the actual transport automation system through virtual transport. By using digital twin models and deep learning technology, it can detect small obstacles in the surrounding area in advance during the operation of the transport robot, thus ensuring the safe operation of the robot.
[0143] Figure 4 This is a schematic diagram of the structure of the digital twin platform provided in an embodiment of the present invention, as shown below. Figure 4 As shown, the digital twin platform also includes a second communication unit, which is used to receive sensor information of the factory handling line sent by the robot and send the information to the control module.
[0144] The control module includes:
[0145] The obstacle detection fusion unit is used to fuse sensor information sent by the digital twin platform and determine whether obstacle avoidance is required.
[0146] The obstacle avoidance control command unit sends obstacle avoidance control command information to the transport robot.
[0147] Figure 5 This is a flowchart illustrating the 5G smart factory small target detection system based on digital twins provided in this embodiment of the invention. Figure 5 As shown, it includes:
[0148] Step 1: The transport robot starts working mode, collects 2D perception information and millimeter-wave radar information in the scene, and transmits them synchronously to the remote digital twin platform.
[0149] Step 2: After receiving the image and radar information transmitted by the handling robot, the digital twin platform performs virtual mapping and simulation of the physical handling entities in the smart factory through model-driven and data-driven approaches.
[0150] Step 3: The digital twin platform processes the real-time data and transmits the sensor information to the control module.
[0151] Step 4: After receiving the sensor information, the control module starts 2D image feature information extraction and fuses the image data with millimeter-wave radar data to obtain accurate obstacle target images and location information.
[0152] Image data fusion with millimeter-wave radar data includes the following steps:
[0153] Step 4.1, 2D feature extraction: Feature pyramid network is used for feature extraction. The backbone network is downsampled and convolutional operation from bottom to top to output features {C2,C3,C4,C5}, which is the first feature map.
[0154] C5 is convolved with a 1x1 matrix to obtain P5. P5 is upsampled by 2 and added to C4 after a 1x1 convolution to obtain P4. P4 is upsampled by 2 and added to C3 after a 1x1 convolution to obtain P3. P3 is upsampled by 2 and added to C2 after a 1x1 convolution to obtain P2.
[0155] In the P4 fusion stage, dilated convolution is added to increase the detector's receptive field and reduce the impact of downsampling. The calculation formula is as follows:
[0156]
[0157] Where, N in For input features, N out s is the output feature, p is the stride, and f is the padding factor.
[0158] That is, in the P4 fusion stage, the input features are P5 and C4, and after adding dilated convolution, the output feature is P4.
[0159] Attention mechanisms were incorporated into the P2 and P3 fusion stages to maximize the preservation of image features in the lower layers and extract key feature points. The specific calculation process is as follows:
[0160]
[0161] Where X and Y represent input and output features with the same dimension, (H, W, and C represent the height, width, and channels of the feature map, respectively), and j and i represent the position indices of the input and output features, respectively.
[0162] f(X i ,X j The function is used to calculate X. i and X j Similarity, g(X) j The function Q(x) is used to calculate the feature representation of position j, where Q(x) is the normalization parameter, and a Gaussian function is used as the similarity function. The specific calculation process is as follows:
[0163]
[0164] θ(X i ) = W θ X i ,
[0165] Among them, W θ and This represents the weight matrix.
[0166] In the P3 fusion stage, the input features are P4 and C3, and after adding the attention mechanism, the output feature is P3.
[0167] In the P2 fusion stage, the input features are P3 and C2. After adding the attention mechanism, the output feature is P2.
[0168] Using the above method, the first feature map {C2,C3,C4,C5} is fused based on dilated convolution and spatial attention to obtain the second feature map {P2,P3,P4,P5}.
[0169] Perform nonlocal operations on the obtained second feature map to obtain the third feature map W. Z Y, the second and third feature maps are superimposed through a residual network to obtain the target feature map, as shown in the following expression:
[0170] Z = W Z Y+X
[0171] Where Z represents the target feature map, W Z Let Y represent the weight matrix, Y represent the output features, and X represent the input features.
[0172] To utilize channel attention to capture dependencies between feature maps and enhance image features on both the context and channels, average pooling and max pooling can be used to significantly improve the network's representational power, generating two distinct spatial context descriptors: F c avg and F c max represents the average pooling feature and the max pooling feature, respectively.
[0173] The two descriptors are then forwarded to a shared network to generate the channel attention graph M. c ∈R C×1×1 .
[0174] The shared network consists of a multilayer perceptron (MLP). To reduce parameter overhead, the activation size of the hidden layer is set to R. C / r×1×1 The operation process for channel attention is as follows:
[0175] M c (Z)=σ(MLP(AvgPool(Z))+MLP(MaxPool(Z)))
[0176] Among them, M c(Z) represents channel attention, σ represents the sigmoid function, MLP represents the multilayer perceptron, Z represents the target feature map, AvgPool represents average pooling, and MaxPool represents max pooling.
[0177] The obtained target feature map Z is processed by M c The module's operation yields the channel attention map S, and the specific calculation process is as follows:
[0178] S = Z * M c (Z)
[0179] Where S represents the channel attention map, Z represents the target feature map, and M represents the target feature map. c (Z) represents channel attention.
[0180] Step 4.2: Spatial attention maps can be generated using the spatial relationships of features. Unlike channel attention, spatial attention focuses on "where" as an information component, thus complementing channel attention.
[0181] To compute spatial attention, average pooling and max pooling operations are first applied along the channel axis and then concatenated to generate effective feature descriptors. Applying pooling operations along the channel axis has been shown to effectively highlight informative regions.
[0182] Convolutional layers are applied to the cascaded feature descriptors to generate a spatial attention map M. s (X)∈R H×W×C It encodes the location to be emphasized or suppressed.
[0183] Two 2D feature maps are generated by aggregating the channel information of the feature maps using two pooling operations: F s avg∈R 1 ×H×W and F s max∈R 1×H×W , where represent the average pooling feature and the max pooling feature in the channel, respectively.
[0184] They are then concatenated and convolved through a standard convolutional layer to generate a 2D feature space attention map, as shown in the following expression:
[0185] M s (X)=σ(f 7×7 ([AvgPool(X);MaxPool(X)]))
[0186] Among them, M s (X) represents the spatial attention map, σ represents the sigmoid function, and f 7×7 This indicates a convolution operation with a filter size of 7×7. AvgPool represents average pooling, and MaxPool represents max pooling.
[0187] Channel attention map S and spatial attention map M s Superimposed on pixels, the calculation formula is as follows:
[0188]
[0189] Among them, F image Representing image features, Let S represent the weight matrix, S represent the channel attention map, and M represent the channel attention map. s Represents a spatial attention map.
[0190] Step 4.3: Millimeter-wave radar data extraction. To reduce radar data instability, the acquired raw radar data can be time- and location-encoded to obtain a time-coded original radar image. The k-th frame radar data image is calculated as follows:
[0191]
[0192] in, Let F represent the radar data of the k-th frame at time t, where F represents the original radar data, n represents the total number of radar frames, k represents the k-th frame, k∈[0,n+1], and t represents time.
[0193] Step 4.4: Using a multilayer perceptron algorithm and fusion parameters, the encoded radar data map is processed to obtain the processed radar feature map, expressed as follows:
[0194]
[0195] in, This represents the processed radar feature map of the k-th frame at time t. Let represent the radar data map of the kth frame at time t, c represent the fusion parameters, and MLP represent the multilayer perceptron.
[0196] The processed radar feature maps are concatenated to obtain the following radar features:
[0197]
[0198] Among them, F radar This represents the overall radar signature after processing; cat indicates cascaded operation. This represents the radar feature map of the k-th frame at time t after processing.
[0199] Step 4.5: By designing a feature fusion network, the data from the two sensors are processed separately to achieve the interaction of feature maps of perceptual information acquired from two different sensors. Optionally, Figure 6 This is a schematic diagram of the network structure for fusing image data and radar data provided in an embodiment of the present invention.
[0200] The formula for fusing the image features and the radar features to determine the target features is as follows:
[0201] F fusion =σ(W1τ(W0(MaxPool(F) image )))+W1τ(W0(MaxPool(F radar ))))
[0202] Among them, F fusion σ represents the target feature after fusion, τ represents the sigmoid function, τ represents the ReLU function, and W1 and W0 represent the weight parameters.
[0203] Step 5: The control module judges the fused sensor information. If obstacle avoidance is required, the control command is transmitted synchronously to the digital twin platform. If obstacle avoidance is not required, the control command is not transmitted.
[0204] Step 6: After receiving the obstacle avoidance control command sent by the digital twin platform, the handling robot actively avoids obstacles.
[0205] Step 7: After avoiding the obstacles, the transport robot continues to move forward, and the process ends.
[0206] The 5G smart factory small target detection system for digital twins provided in this invention can serve as an effective supplement to obstacle avoidance for handling robots in smart factories. When the handling operation environment becomes harsh, the robot itself cannot judge obstacle information in a timely and effective manner. The digital twin platform can effectively integrate information from cameras and millimeter-wave radar by constructing operation information synchronized with the real scene and combining it with the high-performance small target detection capability of a remote server. At the same time, it can solve the problem of hardware consumption at the edge of the handling robot. The fusion of a large amount of data also improves the performance of the perception algorithm, providing the digital twin platform with more accurate judgment capabilities and providing a favorable guarantee for the safe and stable operation of the handling robot.
[0207] The following describes the small target detection device for 5G smart factories based on digital twins provided by the present invention. The small target detection device for 5G smart factories based on digital twins described below can be referred to in correspondence with the small target detection method for 5G smart factories based on digital twins described above.
[0208] Figure 7 This is a schematic diagram of the structure of the 5G smart factory small target detection device for digital twins provided in an embodiment of the present invention, as shown below. Figure 7 As shown, the 5G smart factory small target detection device for digital twins provided in this embodiment of the invention includes:
[0209] The receiving module 710 is used to receive image data and radar data sent by the transport robot; the image data is image data of the front of the transport robot; the radar data is radar data of obstacles around the transport robot.
[0210] The first determining module 720 is used to determine image features based on the channel attention map and spatial attention map of the image data;
[0211] The second determining module 730 is used to encode the radar data and determine radar characteristics;
[0212] The fusion module 740 is used to fuse the image features and the radar features to determine the target features;
[0213] The detection module 750 is used to perform target detection based on the target features.
[0214] It should be noted that the 5G smart factory small target detection device for digital twins provided in this embodiment of the invention can implement all the method steps implemented in the above embodiment of the 5G smart factory small target detection method for digital twins, and can achieve the same technical effect. Therefore, the parts and beneficial effects that are the same as those in the method embodiment will not be described in detail here.
[0215] Optionally, it also includes: a third determining module, used for:
[0216] The image data is downsampled to determine multiple first feature maps;
[0217] The first feature map is fused based on dilated convolution and attention mechanisms to determine the second feature map;
[0218] Perform nonlocal operations on the second feature map to determine the third feature map;
[0219] The target feature map is determined based on the second feature map and the third feature map;
[0220] Based on the target feature map, the channel attention map and the spatial attention map are determined.
[0221] Optionally, the third determining module is specifically used for:
[0222] Based on pooling operations, the spatial information of the target feature map is aggregated to determine the first average pooling feature and the first max pooling feature of the target feature map.
[0223] The channel attention map is determined based on the multilayer perceptron function, the first average pooling feature, and the first max pooling feature.
[0224] Optionally, the third determining module is specifically used for:
[0225] Based on pooling operations, the channel information of the target feature map is aggregated to determine the second average pooling feature and the second max pooling feature of the target feature map.
[0226] The spatial attention map is determined by convolving the second average pooling feature and the second max pooling feature.
[0227] Optionally, the formula for determining image features based on the channel attention map and spatial attention map of the image data is as follows:
[0228]
[0229] Among them, F image Representing image features, Let S represent the weight matrix, S represent the channel attention map, and M represent the channel attention map. s Represents a spatial attention map.
[0230] Optionally, the formula for fusing the image features and the radar features to determine the target features is as follows:
[0231] F fusion =σ(W1τ(W0(MaxPool(F) image )))+W1τ(W0(MaxPool(F radar ))))
[0232] Among them, F fusion Let F represent the target feature, σ represent the sigmoid function, τ represent the ReLU function, W1 and W0 represent the weight parameters, and F image F represents image features. radar This represents radar characteristics, and MaxPool represents maximum pooling.
[0233] Figure 8 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 8As shown, the electronic device may include a processor 810, a communication interface 820, a memory 830, and a communication bus 840. The processor 810, communication interface 820, and memory 830 communicate with each other via the communication bus 840. The processor 810 can call logical instructions in the memory 830 to execute a small target detection method for a 5G smart factory based on digital twins. This method includes: receiving image data and radar data sent by a transport robot; the image data being image data in front of the transport robot; the radar data being radar data of obstacles around the transport robot; determining image features based on the channel attention map and spatial attention map of the image data; encoding the radar data to determine radar features; fusing the image features and the radar features to determine target features; and performing target detection based on the target features.
[0234] Furthermore, the logical instructions in the aforementioned memory 830 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, essentially, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0235] On the other hand, the present invention also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the small target detection method for 5G smart factories oriented towards digital twins provided by the above methods. The method includes: receiving image data and radar data sent by a transport robot; the image data being image data in front of the transport robot; the radar data being radar data of obstacles around the transport robot; determining image features based on the channel attention map and spatial attention map of the image data; encoding the radar data to determine radar features; fusing the image features and the radar features to determine target features; and performing target detection based on the target features.
[0236] In another aspect, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the small target detection method for 5G smart factories oriented towards digital twins provided by the methods described above. This method includes: receiving image data and radar data sent by a transport robot; the image data being image data in front of the transport robot; the radar data being radar data of obstacles surrounding the transport robot; determining image features based on the channel attention map and spatial attention map of the image data; encoding the radar data to determine radar features; fusing the image features and the radar features to determine target features; and performing target detection based on the target features.
[0237] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0238] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0239] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A method for detecting small targets in a 5G smart factory based on digital twins, characterized in that, include: Receive image data and radar data sent by the transport robot; the image data is image data of the front of the transport robot. The radar data is the radar data of obstacles around the transport robot; Based on the channel attention map and spatial attention map of the image data, image features are determined; The radar data is encoded to determine radar characteristics; The image features and the radar features are fused to determine the target features; Target detection is performed based on the aforementioned target features; Before determining image features based on the channel attention map and spatial attention map of the image data, the method further includes: The image data is downsampled to determine multiple first feature maps; The first feature map is fused based on dilated convolution and attention mechanisms to determine the second feature map; Perform nonlocal operations on the second feature map to determine the third feature map; Based on the second feature map and the third feature map, the target feature map is determined; Based on the target feature map, the channel attention map and the spatial attention map are determined; Determining the channel attention map based on the target feature map includes: Based on pooling operations, the spatial information of the target feature map is aggregated to determine the first average pooling feature and the first max pooling feature of the target feature map. Based on the multilayer perceptron function, the first average pooling feature and the first max pooling feature, the channel attention map is determined; Determining the spatial attention map based on the target feature map includes: Based on pooling operations, the channel information of the target feature map is aggregated to determine the second average pooling feature and the second max pooling feature of the target feature map. The spatial attention map is determined by convolving the second average pooling feature and the second max pooling feature.
2. The method for small target detection in a 5G smart factory based on digital twins as described in claim 1, characterized in that, The formula for determining image features based on the channel attention map and spatial attention map of the image data is as follows: ; in, Representing image features, Represents the weight matrix. This represents a channel attention map. Represents a spatial attention map.
3. The method for small target detection in a 5G smart factory based on digital twins as described in claim 1, characterized in that, The formula for fusing the image features and the radar features to determine the target features is as follows: ; in, Indicate target features, This represents the sigmoid function. Represents the ReLU function. and Represents the weight parameters. Representing image features, Indicates radar characteristics, This indicates max pooling.
4. A small target detection device for 5G smart factories based on digital twins, characterized in that, include: The receiving module is used to receive image data and radar data sent by the transport robot; the image data is image data of the front of the transport robot. The radar data is the radar data of obstacles around the transport robot; The first determining module is used to determine image features based on the channel attention map and spatial attention map of the image data; The second determining module is used to encode the radar data and determine radar characteristics; The fusion module is used to fuse the image features and the radar features to determine the target features; A detection module is used to perform target detection based on the target features; Before determining image features based on the channel attention map and spatial attention map of the image data, the first determining module is further configured to: The image data is downsampled to determine multiple first feature maps; The first feature map is fused based on dilated convolution and attention mechanisms to determine the second feature map; Perform nonlocal operations on the second feature map to determine the third feature map; Based on the second feature map and the third feature map, the target feature map is determined; Based on the target feature map, the channel attention map and the spatial attention map are determined; Based on pooling operations, the spatial information of the target feature map is aggregated to determine the first average pooling feature and the first max pooling feature of the target feature map. Based on the multilayer perceptron function, the first average pooling feature and the first max pooling feature, the channel attention map is determined; Based on pooling operations, the channel information of the target feature map is aggregated to determine the second average pooling feature and the second max pooling feature of the target feature map. The spatial attention map is determined by convolving the second average pooling feature and the second max pooling feature.
5. An electronic device comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, When the processor executes the program, it implements the 5G smart factory small target detection method for digital twins as described in any one of claims 1 to 3.
6. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the 5G smart factory small target detection method for digital twins as described in any one of claims 1 to 3.
7. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the 5G smart factory small target detection method for digital twins as described in any one of claims 1 to 3.