Communication tower-mounted equipment classification method and system based on deep learning

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By using a deep learning-based YOLOv7-VSZ network model, combined with a multi-attention mechanism and an efficient aggregation network ELAN, the problem of accurate positioning and identification of communication tower-mounted equipment was solved, achieving high-precision equipment classification and location detection.

CN116091949BActive Publication Date: 2026-06-26WUHAN SIZHONG SPACE INFORMATION TECH CO LTD

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: WUHAN SIZHONG SPACE INFORMATION TECH CO LTD
Filing Date: 2023-02-01
Publication Date: 2026-06-26

Application Information

Patent Timeline

01 Feb 2023

Application

26 Jun 2026

Publication

CN116091949B

IPC: G06V20/17; G06V10/774; G06V10/764; G06V10/82; G06N3/08; G06N3/0464

CPC: G06V20/17; G06V10/774; G06V10/764; G06V10/82; G06N3/08; Y02T10/40

AI Tagging

Technology Topics

Data set Simulation

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A fixed pattern noise removal method based on a coding-decoding network
CN122243784AImage enhancement Biological models Data setPaired Data
PCB defect detection method and device, computer device and storage medium
CN122262959AImprove quality control levelovercome vulnerabilityData set Algorithm
A laser speckle deformation thermodynamic displacement measurement method and system based on DL-SpeckleNet
CN122258745ABiological models Character and pattern recognition Data setOptical measurements
A three-dimensional spatial organ medical image feature extraction system and method for reducing false positives
JP7877550B1Image analysis Sensors Data setImage code
Target text retrieval method and apparatus
CN115730037BFeature vector Data set

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies struggle to accurately identify and locate the diverse equipment mounted on communication towers in complex and ever-changing electromagnetic radiation environments, and manual identification presents both dangers and time-consuming issues.

Method used

A YOLOv7-VSZ network model based on deep learning is adopted, combined with a multi-attention mechanism and an efficient aggregation network ELAN. High-precision centimeter-level images of communication towers are acquired by UAVs, preprocessed and trained to generate a YOLOv7-VSZ network model, which is used to identify and locate communication tower mounted equipment.

Benefits of technology

It improves the feature extraction capability and recognition accuracy in complex backgrounds, and enables precise classification and location positioning of communication tower-mounted equipment, reducing the danger and time consumption of manual identification.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN116091949B_ABST

Patent Text Reader

Abstract

The application discloses a communication tower mounted equipment classification method and system based on deep learning, the method comprises the following steps: acquiring height data and attitude data at the moment when a picture of a communication tower is shot by a UAV, preprocessing the collected picture, and constructing a data set; in the calculation of the loss function of the head module, the height, angle and condition of the communication tower mounted equipment when the picture is shot by the UAV are considered, a multiple attention mechanism module is introduced in the predict module, and a yolov7-vsz network model is generated; the data set is input into the yolov7-vsz network model for communication tower mounted equipment classification training, and the optimal model is reserved; the picture to be tested is input into the trained yolov7-vsz network model, and a communication tower mounted equipment classification result is output. The yolov7-vsz network model has high recognition accuracy and can identify target equipment under different heights, different shooting angles and different mounted equipment conditions.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of target detection technology, specifically relating to a method and system for classifying communication tower mounted equipment based on deep learning. Background Technology

[0002] The equipment mounted on communication towers is multi-layered, diverse in type, and uses inconsistent communication frequencies. Coupled with the complex and variable electromagnetic radiation environment and the wide variety of communication equipment, identifying, classifying, and applying this equipment presents numerous challenges. Accurately and efficiently extracting the necessary equipment characteristics from the actual environment and determining the equipment type and location is a crucial direction in the current classification and identification of communication tower-mounted equipment. Current methods for identifying communication tower equipment mainly focus on signal identification and equipment image interpretation. Manually inspecting communication equipment on the tower presents problems such as danger and excessive time consumption.

[0003] Target classification and detection, a fundamental task in computer vision, has achieved remarkable results in numerous scenarios in recent years. However, current methods still have many shortcomings in identifying and classifying equipment mounted on power transmission towers in real-world scenarios, and in most cases, feature extraction and identification can only be performed on a single communication device. For example, CN108037133A discloses an intelligent identification method and system for power equipment defects based on UAV inspection images, and discloses the automatic location and identification of typical components on power transmission towers by collecting images of transmission lines through UAVs and extracting image features. Although the principle can be applied to the identification of equipment mounted on communication towers, the equipment mounted on communication towers is denser and more diverse than that mounted on power transmission towers. Due to issues such as inconsistent UAV flight shooting angles, obstruction between communication devices, and inconsistent positional orientations between communication devices, the use of such image recognition methods suffers from problems such as unclear data features, making it difficult to accurately identify communication devices mounted on communication towers and accurately locate the positions of the equipment mounted on communication towers. Summary of the Invention

[0004] In view of this, the present invention proposes a method and system for classifying communication tower mounted equipment based on deep learning, which is used to solve the problem that existing target detection methods are inaccurate in identifying mounted equipment on communication towers.

[0005] In a first aspect, this invention discloses a method for classifying communication tower-mounted equipment based on deep learning, the method comprising:

[0006] Acquire the altitude and attitude data of the communication tower at the moment the drone takes pictures, preprocess the acquired images, and construct a dataset;

[0007] In calculating the loss function of the head module, the altitude, angle and condition of the communication tower mounting equipment when the drone takes pictures are considered. A multi-attention mechanism module is introduced in the predict module to generate the yolov7-vsz network model.

[0008] Input the dataset into the yolov7-vsz network model for training the classification of communication tower mounted equipment, and retain the optimal model;

[0009] Input the image to be tested into the trained yolov7-vsz network model, and output the recognition result of the communication tower mounted equipment.

[0010] Based on the above technical solutions, preferably, the specific structure of the yolov7-vsz network model includes:

[0011] The backbone module employs module-level reparameterization and uses the efficient aggregation network ELAN, which consists of an input layer, a convolutional layer, a first ELAN layer, a max pooling layer, a second ELAN layer, a max pooling layer, a third ELAN layer, a max pooling layer, and a fourth ELAN layer connected in sequence.

[0012] The head module employs auxiliary head training and corresponding positive and negative sample matching strategies to perform high-order information space interaction fusion and multi-branch feature extraction on the output results of different ELAN layers of the backbone module, outputting feature maps of three different sizes.

[0013] The predict module employs an attention mechanism to extract important features from feature maps of different sizes output by the head module in complex backgrounds.

[0014] Based on the above technical solutions, preferably, the processing procedure of the head module specifically includes:

[0015] The fourth ELAN layer passes through the APPCSPC layer, convolutional layer, and upsampling layer in sequence, and is connected to the result of the third ELAN layer after the convolutional layer is processed by the first connection layer.

[0016] The connection result of the first connection layer is processed sequentially through the first CH_Block layer, the convolutional layer, and the upsampling layer. Then, it is connected with the processing result of the first ELAN layer after the convolutional layer and the processing result of the second ELAN layer after the convolutional layer through the second connection layer.

[0017] The connection result of the second connection layer is sequentially passed through the second CH_Block layer and the convolutional layer, and then connected with the processing result of the first CH_Block layer through the third connection layer; after processing by the second CH_Block layer, the first feature map is output;

[0018] The connection result of the third connection layer is sequentially passed through the third CH_Block layer and the convolutional layer, and then connected with the processing result of the APPCSPC layer through the fourth connection layer; after processing by the third CH_Block layer, the second feature map is output;

[0019] The connection results of the fourth connection layer are processed by the fourth CH_Block layer and output as the third feature map.

[0020] Based on the above technical solutions, preferably, the processing procedure of the predict module specifically includes:

[0021] The first feature map is processed sequentially through the Attention layer, convolutional layer, and detect layer, and then the first prediction result is output.

[0022] The second feature map is processed sequentially through the Attention layer, convolutional layer, and detect layer before the second prediction result is output.

[0023] The third feature map is processed sequentially through the Attention layer, convolutional layer, and detect layer before the third prediction result is output.

[0024] The present invention has the following advantages over the prior art:

[0025] 1) In calculating the loss function of the head module, this invention considers the actual conditions such as the altitude, angle and communication tower mounting equipment when the drone takes pictures. A multi-attention mechanism module is introduced in the predict module to generate a YOLOv7-VSZ network model, which can perform high-order information space interaction fusion and multi-branch feature extraction. The multi-attention mechanism can improve the model's ability to extract deep and important features, and ensure that the extracted feature map has a high resolution during the feature extraction process, thereby improving the model's feature extraction ability in complex backgrounds and improving recognition accuracy.

[0026] 2) This invention introduces multiple CH_Block layers in the head module. The CH_Block layers combine the Conv layers and the HorNet network. They can utilize the gate convolution of recursive gate convolution gnConv and the recursive design to achieve high-order spatial interaction. They are highly flexible and can be easily combined with Conv layers to perform extended interactions at different levels, improve network depth and feature interaction capabilities, and accelerate network learning progress. Attached Figure Description

[0027] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0028] Figure 1 This is a flowchart of the deep learning-based classification method for communication tower mounting equipment according to the present invention.

[0029] Figure 2 This is a schematic diagram of the YOLOv7-VSZ network model.

[0030] Figure 3 This is a schematic diagram of the ELAN network structure;

[0031] Figure 4 This is a schematic diagram of the network structure of CH_Block;

[0032] Figure 5 This is a schematic diagram of the auxiliary header structure for the head module. Detailed Implementation

[0033] The technical solutions of the present invention will be clearly and completely described below with reference to the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of the present invention.

[0034] Please see Figure 1 This invention proposes a classification method for communication tower mounted equipment based on deep learning, the method comprising:

[0035] S1. Obtain the altitude and attitude data of the communication tower at the moment the drone takes pictures, preprocess the collected images, and construct a dataset.

[0036] Step S1 specifically includes the following sub-steps:

[0037] S11. Take high-precision centimeter-level images of communication tower-mounted equipment using a drone, and record the drone's altitude and attitude data during the shooting. The shooting angle can be obtained through the attitude data.

[0038] S12. Pre-classify the collected images, specifically grouping data from the same communication device at the same surround height and the same tilt shooting angle into one category.

[0039] S13. Label the pre-classified images and generate labeled base maps through manual labeling.

[0040] S14. Construct an image cropping and restoration model, uniformly crop the labeled base map, automatically generate cropped labeled images based on the labeled base map, and produce training samples of standard size without loss of pixel accuracy.

[0041] After image cropping, annotations are automatically extracted and further manually verified to obtain training samples with different shooting heights and shooting angles.

[0042] S15. Statistically analyze the proportion of different types of training samples for various communication tower mounted equipment under different height and angle conditions. Based on the statistical results, filter the samples and construct a class-balanced classification training dataset.

[0043] Step S1 is the data acquisition and preprocessing process for communication tower mounted equipment. A series of preprocessing methods are used to generate a dataset with standard size, high precision, and balanced type proportions under different heights and angles for model training. This enables the model to support input of large-scale, high-precision centimeter-level images, and retains the feature information of communication tower mounted equipment without loss, which is beneficial to improving classification accuracy and speed.

[0044] S2. In calculating the loss function of the head module, considering the altitude, angle and condition of the communication tower equipment when the drone takes pictures, a multi-attention mechanism module is introduced in the predict module to generate the yolov7-vsz network model.

[0045] like Figure 2 As shown, the YOLOv7-VSZ network model includes a backbone module, a head module, and a predict module. Their specific structure and functions are as follows:

[0046] The backbone module employs module-level reparameterization, decomposing a module into reusable forms, and utilizes the efficient aggregation network ELAN. For example... Figure 2 As shown, the backbone module includes an input layer, a convolutional layer, a first ELAN layer, a max pooling layer, a second ELAN layer, a max pooling layer, a third ELAN layer, a max pooling layer, and a fourth ELAN layer, connected in sequence.

[0047] The head module employs auxiliary head training and corresponding positive and negative sample matching strategies to perform high-order information space interaction fusion and multi-branch feature extraction on the output results of different ELAN layers of the backbone module, outputting feature maps of three different sizes.

[0048] The `head` module utilizes an auxiliary head for training and a corresponding positive / negative sample matching strategy. Based on the adjusted auxiliary head structure, performance can be improved through multi-branching, and inference speed can be accelerated through structural reparameterization. The adjusted auxiliary head structure is as follows: Figure 5 As shown.

[0049] like Figure 2 As shown, the processing procedure of the head module specifically includes:

[0050] The fourth ELAN layer passes through the APPCSPC layer, the convolutional layer (Conv), and the upsampling layer (UPsample) in sequence, and is then connected to the result of the third ELAN layer after the convolutional layer is processed through the first connection layer (first Concat).

[0051] The connection result of the first connection layer is processed sequentially through the first CH_Block layer, the convolutional layer, and the upsampling layer. Then, it is connected with the processing result of the first ELAN layer after the convolutional layer and the processing result of the second ELAN layer after the convolutional layer through the second connection layer (second Concat).

[0052] The connection result of the second connection layer passes through the second CH_Block layer and the convolutional layer in sequence, and is then connected with the processing result of the first CH_Block layer through the third connection layer (third Concat); after processing by the second CH_Block layer, the first feature map is output;

[0053] The connection result of the third connection layer is sequentially passed through the third CH_Block layer and the convolutional layer, and then connected with the processing result of the APPCSPC layer through the fourth connection layer (fourth Concat); after processing by the third CH_Block layer, the second feature map is output;

[0054] The connection results of the fourth connection layer are processed by the fourth CH_Block layer and output as the third feature map.

[0055] The first, second, third, and fourth ELAN layers have the same structure and all use a high-efficiency aggregation network, ELAN. Figure 3 The diagram shows the structure of each ELAN layer. The efficient aggregation network ELAN propagates input and gradient information through multiple layers, facilitating information exchange between layers. The ELAN network structure can control connection paths of varying lengths, enabling the network to learn and converge effectively.

[0056] The first, second, third, and fourth CH_Block layers have the same structure, all being a combination of Conv layers and HorNet networks, as shown in the following diagram. Figure 4As shown, it utilizes recursive gate convolution g n Conv's gated convolution and recursive design enable higher-order spatial interactions, which are more conducive to higher-order feature interactions, deepen the network, and accelerate the network learning process.

[0057] In calculating the loss function of the Head module, the height and angle of the photos taken by the drone, as well as the condition of the communication tower-mounted equipment, are taken into account. For example, the actual image shooting situations include inconsistent shooting angles between drones and drones, occlusion between communication devices, and inconsistent position and orientation between multiple communication devices. By constraining the overlap area between the predicted bounding box and the ground truth bounding box of the target mounted equipment, the distance between the center points, and the aspect ratio during the training process, the yolov7-vsz network model is able to recognize images taken at different heights, angles, with occlusion between devices, or with inconsistent position and orientation between communication devices.

[0058] The loss function L of the head module CIOU The calculation formula is as follows:

[0059]

[0060] Where v represents the parameter for aspect ratio consistency:

[0061]

[0062] Where A represents the predicted bounding box and B represents the ground truth bounding box; IOU(A,B) represents the overlap between A and B, D1 is the distance between the center points of the predicted bounding box A and the ground truth bounding box B, and D2 is the distance between the diagonals of the predicted bounding box A and the ground truth bounding box B; w gt To predict the width of bounding box A, h gt The height of the predicted bounding box A is given by w, the width of the ground truth bounding box B is given by h, and the height of the ground truth bounding box B is given by h.

[0063] This invention incorporates the height and angle of photos taken by drones, as well as the condition of equipment mounted on communication towers, into the loss function calculation, enabling the yolov7-vsz network model to identify target equipment at different heights, shooting angles, and equipment conditions, thereby improving the model's predictive ability.

[0064] The predict module employs an attention mechanism to extract important features from feature maps of different sizes output by the head module in complex backgrounds.

[0065] like Figure 2 As shown, the processing steps of the predict module specifically include:

[0066] The first feature map is processed sequentially through the Attention layer, convolutional layer, and detect layer, and then the first prediction result is output.

[0067] The second feature map is processed sequentially through the Attention layer, convolutional layer, and detect layer before the second prediction result is output.

[0068] The third feature map is processed sequentially through the Attention layer, convolutional layer, and detect layer before the third prediction result is output.

[0069] The Predict module, without introducing additional parameters, assesses the importance of neurons by training the linear separability of neurons within the same channel through the Attention layer. This allows the model to ignore irrelevant information and focus on important information, improving its ability to extract deep, important features. It ensures high-resolution feature maps during feature extraction and effectively extracts multi-scale features through multiple feature fusion operations at different scales, thus improving the model's feature extraction capabilities in complex contexts and offering relatively fast training speed. Combined with convolutional operations, it allows the size of the receptive field to adaptively adjust across multiple scales based on user information.

[0070] S3. Input the dataset into the yolov7-vsz network model for training the classification of communication tower mounted equipment, and retain the optimal model.

[0071] Hyperparameter settings were applied to the YOLOv7-VSZ network model. Centimeter-precision training samples of communication towers from the dataset were input into the backbone module. The backbone module used module-level reparameterization for feature extraction. Then, the head module performed high-order information space interaction fusion and multi-branch feature extraction, outputting feature maps of three different sizes. The predict module output the prediction results through an attention mechanism and a Conv layer. The parameters of the YOLOv7-VSZ network model were optimized through continuous training until the set training termination condition was reached, resulting in the optimal model for identification of communication tower mounted equipment.

[0072] S4. Input the image to be tested into the trained yolov7-vsz network model and output the recognition result of the communication tower mounted equipment.

[0073] Before inputting the image to be tested into the model, the image is cropped using the same preprocessing method as in step S1. After the cropped image is input into the model for recognition, the output images are spliced together and their types are fused to obtain the final output result.

[0074] This invention designs a YOLOv7-VSZ network model based on high-precision centimeter-level images of communication tower-mounted equipment acquired by UAVs. The model is trained on high-precision centimeter-level labeled images, thus realizing a method for accurate classification of communication tower-mounted equipment.

[0075] Corresponding to the above method embodiments, the present invention also proposes a deep learning-based classification system for communication tower mounting equipment, the system comprising:

[0076] Data preprocessing module: used to acquire altitude and attitude data of communication tower images taken by drones at the time of capture, preprocess the acquired images, and build datasets;

[0077] Model building module: In the calculation of the loss function of the head module, the altitude, angle and condition of the communication tower mounted equipment when the drone takes pictures are considered. A multi-attention mechanism module is introduced in the predict module to generate the yolov7-vsz network model.

[0078] Model training module: Used to input datasets into the yolov7-vsz network model for training the classification of communication tower mounted equipment, and retain the optimal model;

[0079] Model application module: This module takes the image to be tested as input into the trained YOLOv7-VSZ network model and outputs the recognition results of communication tower mounted equipment.

[0080] The above system embodiments and method embodiments are one-to-one correspondences. For a brief description of the system embodiments, please refer to the method embodiments.

[0081] The present invention also discloses an electronic device, comprising: at least one processor, at least one memory, a communication interface, and a bus; wherein the processor, memory, and communication interface communicate with each other through the bus; the memory stores program instructions executable by the processor, and the processor calls the program instructions to implement the aforementioned method of the present invention.

[0082] The present invention also discloses a computer-readable storage medium that stores computer instructions, which cause the computer to implement all or part of the steps of the method described in the embodiments of the present invention. The storage medium includes various media capable of storing program code, such as a USB flash drive, a portable hard drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.

[0083] The system embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units, meaning they can be distributed across multiple network units. Those skilled in the art can select some or all of the modules to achieve the purpose of this embodiment without any inventive effort, based on actual needs.

[0084] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A classification method for communication tower-mounted equipment based on deep learning, characterized in that, The method includes: Acquire the altitude and attitude data of the communication tower at the moment the drone takes pictures, preprocess the acquired images, and construct a dataset; Based on the YOLOv7 network model, the calculation of the loss function in the head module takes into account the altitude, angle and the condition of the communication tower mounting equipment when the drone takes pictures, and introduces a multi-attention mechanism module in the predict module to generate the YOLOv7-VSZ network model. Input the dataset into the yolov7-vsz network model for training the classification of communication tower mounted equipment, and retain the optimal model; Input the image to be tested into the trained yolov7-vsz network model, and output the recognition result of communication tower mounted equipment; The steps of acquiring the altitude and attitude data of the communication tower at the moment the drone takes pictures, preprocessing the acquired images, and constructing the dataset include: The drone acquires centimeter-level images of communication tower-mounted equipment and obtains shooting information, including the shooting angle, the drone's altitude data, and attitude data at the time of shooting. The shooting angle is determined by the attitude data. The images of the communication tower-mounted equipment at the centimeter level are pre-classified, and the data of the same communication equipment with the same surround height and the same tilt shooting angle are grouped into one category to obtain pre-classified images; The pre-classified images are labeled to generate a labeled base map; A model for image cropping and restoration is constructed, and the labeled base image is uniformly cropped to obtain training samples. The proportion of different types of training samples for various communication tower-mounted equipment under different height and angle conditions was statistically analyzed. Based on the statistical results, the training samples were filtered to obtain a dataset.

2. The method for classifying communication tower-mounted equipment based on deep learning according to claim 1, characterized in that, The specific structure of the yolov7-vsz network model includes: The backbone module employs module-level reparameterization and uses the efficient aggregation network ELAN, which consists of an input layer, a convolutional layer, a first ELAN layer, a max pooling layer, a second ELAN layer, a max pooling layer, a third ELAN layer, a max pooling layer, and a fourth ELAN layer connected in sequence. The head module employs auxiliary head training and corresponding positive and negative sample matching strategies to perform high-order information space interaction fusion and multi-branch feature extraction on the output results of different ELAN layers of the backbone module, outputting feature maps of three different sizes. The predict module employs an attention mechanism to extract important features from feature maps of different sizes output by the head module in complex backgrounds.

3. The deep learning-based classification method for communication tower mounting equipment according to claim 2, characterized in that, The processing procedure of the head module specifically includes: The fourth ELAN layer passes through the SPPCSPC layer, convolutional layer, and upsampling layer in sequence, and is connected to the result of the third ELAN layer after the convolutional layer is processed by the first connection layer. The connection result of the first connection layer is processed sequentially through the first CH_Block layer, the convolutional layer, and the upsampling layer. Then, it is connected with the processing result of the first ELAN layer after the convolutional layer and the processing result of the second ELAN layer after the convolutional layer through the second connection layer. The connection result of the second connection layer passes through the second CH_Block layer and the convolutional layer in sequence, and is then connected with the processing result of the first CH_Block layer through the third connection layer; after processing by the second CH_Block layer, the first feature map is output; The connection result of the third connection layer is sequentially passed through the third CH_Block layer and the convolutional layer, and then connected with the processing result of the SPPCSPC layer through the fourth connection layer; after processing by the third CH_Block layer, the second feature map is output; The connection result of the fourth connection layer is processed by the fourth CH_Block layer and outputs the third feature map; The first CH_Block layer, the second CH_Block layer, the third CH_Block layer, and the fourth CH_Block layer have the same structure, which is a combination of the Conv layer and the HorNet network.

4. The deep learning-based classification method for communication tower mounting equipment according to claim 3, characterized in that, The specific processing steps of the predict module include: The first feature map is processed sequentially through the Attention layer, convolutional layer, and detect layer, and then the first prediction result is output. The second feature map is processed sequentially through the Attention layer, convolutional layer, and detect layer, and then the second prediction result is output. The third feature map is processed sequentially through the Attention layer, convolutional layer, and detect layer before the third prediction result is output.