A network model for small target detection and application
By interacting text and image features in the GLIP network model and adding branches from shallow features to deep features, the accuracy and robustness issues of small tennis object detection are solved, achieving higher detection precision.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHEJIANG GONGSHANG UNIVERSITY
- Filing Date
- 2024-01-02
- Publication Date
- 2026-06-12
AI Technical Summary
Existing image and text multimodal models perform poorly in detecting small targets, such as tennis balls. They struggle to accurately identify small targets under conditions of high-speed motion and deformation, and are greatly affected by lighting, occlusion, and size changes.
We adopt a network model based on the GLIP structure, interact text and image features through a cross-modal multi-head attention mechanism, and add a branch from shallow features to deep features in the multimodal fusion module to design a multimodal deep fusion module, deep_fusion, to enhance feature fusion capabilities.
The accuracy of small tennis ball detection was improved from 72.8% to 75.4%, reducing the interference of high-speed rotation and deformation on detection and improving the robustness and accuracy of the model.
Smart Images

Figure CN117829210B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of object detection in computer vision, and relates to a network model and its application for small object detection. Background Technology
[0002] In recent years, with social development, people have paid increasing attention to competitive sports. In tennis matches, the ability to monitor the tennis ball in real-time throughout the entire match is particularly important. Continuously monitoring the ball in real-time during the game allows for timely feedback and assistance to referees in their rulings, increasing the fairness of the match. It also enables timely post-match analysis, helping players improve their training efficiency.
[0003] Current methods for identifying tennis balls using human visual inspection or cameras are subjective and inefficient. Most ball detection is performed at normal speeds; however, tennis balls are characterized by high speed and small size, and may exhibit issues such as motion blur and deformation. The appearance of a tennis ball under high-speed conditions differs significantly from that under stationary conditions, posing a considerable challenge to research on tennis ball target detection.
[0004] On the other hand, tennis balls are small targets relative to the entire tennis court. Small targets usually do not have very complete features and have low resolution, which means that there are fewer features to learn and it is difficult to extract features well. In addition, they are extremely susceptible to the influence and interference of the surrounding environment, such as lighting, occlusion and size changes, which makes it difficult for the model to accurately locate and identify small targets, further increasing the difficulty of small target detection.
[0005] Multimodal models based on images and text have become a research hotspot in recent years, and have recently begun to find many applications in the field of object detection. Multimodal object detection is a technique that uses information from multiple senses (such as vision, hearing, and touch) to identify and locate objects. By fusing information from different modalities, it maps input data from different modalities to a common feature space, thereby achieving information fusion, improving the accuracy and robustness of object detection, and has broad application prospects and good object detection results.
[0006] Multimodal object detection models, spearheaded by GLIP, have become a new paradigm for multimodal object detection. However, due to the small size, susceptibility to deformation, and high speed of tennis balls, the detection results using only image and text multimodal models are unsatisfactory in practical applications, and there is still much room for improvement. Summary of the Invention
[0007] This invention addresses the shortcomings of existing technologies by providing a network model and its application for small target detection.
[0008] In a first aspect, the present invention provides a network model for small object detection. This network model is based on the GLIP structure, replaces the original text query method with a multimodal query method, and utilizes a cross-modal multi-head attention mechanism to enable visual guidance in the text features, allowing the text features to perceive visual details.
[0009] Secondly, this invention provides an application of a network model for small target detection in tennis ball detection.
[0010] The beneficial effects of this invention are:
[0011] The GLIP-based architecture proposed in this invention employs a cross-modal multi-head attention mechanism to interact with text and image features, replacing the original text-based query method with a multi-modal query approach. Furthermore, the cross-modal multi-head attention mechanism incorporates visual guidance into text features, enabling text features to perceive visual details, resulting in more complete and richer feature extraction by the network, and enhancing the expressiveness of the features.
[0012] Meanwhile, the model proposes a new multimodal deep fusion module, deep_fusion, which replaces the original multimodal fusion structure, enhances the fusion capability of multimodal features, and adds a branch to add shallow features to deep features. The deep network retains more feature information of small targets, enabling the network to better learn the key features of small tennis targets. The key feature information is retained in the deep structure, which is more suitable for the accurate detection of small tennis targets proposed in this invention. Attached Figure Description
[0013] To more clearly illustrate the network structure and training process of the present invention, the accompanying drawings required in the embodiments will be briefly introduced below.
[0014] Figure 1 This is a tennis ball target detection process according to an embodiment of this application.
[0015] Figure 2 This is a multimodal network structure diagram including multimodal query methods in an embodiment of this application.
[0016] Figure 3 This is a structural diagram of the deep_fusion module for multimodal deep fusion in an embodiment of this application.
[0017] Figure 4 The images are for visualization purposes. The left side shows the GLIP detection effect, and the right side shows the network detection effect of the embodiment of this application. Detailed Implementation
[0018] To describe the present invention in more detail, the following will provide a detailed description of the invention in conjunction with the accompanying drawings and specific embodiments.
[0019] This application addresses the problem of small tennis ball target detection by proposing a multimodal method based on images and text. It aims to solve the challenge of detecting small tennis balls during high-speed motion. This application optimizes the GLIP network structure, replacing the previous text query method with a multimodal query approach. By adding cross-modal multi-head attention, text features and image features interact, resulting in text features that incorporate guidance from image features, containing more detailed information. This allows text features to perceive visual details and better uncover the detailed information of small targets.
[0020] Meanwhile, based on the multimodal fusion structure, a multimodal deep fusion module, deep_fusion, is proposed, which adds a branch to add shallow features to deep features. This allows the small target feature information of the shallow network to be better incorporated into the deep network, which helps the network learn the small target features of the tennis ball, effectively improves the accuracy of tennis ball detection, and reduces the interference of high-speed rotating and deformed tennis balls on the model detection.
[0021] In this embodiment, the network model for small object detection is based on the GLIP network structure, which extracts features from the input image using an image encoder. The input text labels are converted into text prompts, which are then processed by a text encoder to extract features. The following improvements are made:
[0022] A cross-modal multi-head attention mechanism is added to interact with text and image features. After this mechanism, the interacting text features incorporate guidance from the image features, containing more detailed information. This approach replaces the limitations of relying solely on text for multimodal queries, weaving visual and textual cues together. This allows text features to perceive visual details and better uncover the nuances of small targets. Finally, the interacting text features are output to facilitate multimodal fusion in subsequent steps.
[0023] Furthermore, in one embodiment: the GLIP-based multimodal fusion structure employs a deep fusion structure, specifically:
[0024] In the multimodal deep fusion module, extracted text and image features are fused using a cross-modal multi-head attention mechanism. Branches are then added to the fused text and image features to incorporate shallow features into deeper features. Since shallow feature information is incorporated into deeper features, and small objects often retain more of their features in shallower layers and are easier to identify, adding branches to incorporate shallow features into deeper features in the multimodal fusion module helps the network learn small object features, allowing them to be better preserved in deeper layers and improving the small object detection performance of the multimodal model. The improved multimodal deep fusion module is named deep_fusion.
[0025] This application also discloses an improved method for detecting small tennis balls based on image and text multimodal fusion. The workflow for detecting small tennis balls is as follows: Figure 1 As shown, the main steps are as follows:
[0026] Step 1. The tennis ball detection network reads tennis ball images in real time;
[0027] Step 2. Input the tennis ball image into the network model for forward inference;
[0028] Step 3. Enter the network to determine if there is a tennis ball in the image. If there is a tennis ball, proceed to step 4; otherwise, proceed to step 5.
[0029] Step 4. The detection system will label the tennis ball and indicate that a tennis ball has been detected in the image;
[0030] Step 5. Continue to input unread images into the network model. If there are unread images, return to step 1; otherwise, end the detection.
[0031] The embodiments of this application will be further described below with reference to the accompanying drawings.
[0032] The tennis ball detection step using the small object detection network model described in this application is as follows:
[0033] Step 1: Shoot actual tennis matches and tennis training videos, extract the high-speed small tennis balls contained in the images frame by frame, and manually label them as a dataset.
[0034] Step 2: The network input includes the image to be detected and the corresponding text label, such as... Figure 2 As shown.
[0035] First, the text labels are converted into text prompts, and text features are extracted using a text encoder. At the same time, the image to be detected is also input into an image encoder to extract image features.
[0036] Add multimodal query methods to the existing model, such as Figure 3 As shown, the specific implementation involves interacting text features and image features through a cross-modal multi-head attention mechanism. The extracted text features contain guidance from image features and contain more feature information. This approach replaces the previous limitation of using only text as a query method with a multimodal query method.
[0037] Compared to text, image features can provide richer cues about the target object. Meanwhile, text features have higher information density, resulting in stronger generalization ability. By using text descriptions with open-set generalization and visual samples with rich descriptive granularity as category queries, visual and textual cues are interwoven. This allows text features to perceive visual details, improving the fine-grainedness of the query, better uncovering detailed information about small targets, and enhancing the overall performance of network detection. Finally, the interactive text features are output, facilitating multimodal fusion in subsequent steps.
[0038] Step 3: As Figure 3 The diagram shows the structure of the multimodal deep fusion module, deep_fusion. In the deep_fusion structure, the input image features are denoted as V0, and the input text features are denoted as T0. After passing through a cross-modal multi-head attention mechanism, V0 and T0 are used to obtain the image features V_cross0 and the text features T_cross0 that are fused with the text features.
[0039] The text feature T1 is obtained by adding T0 and T_cross0 features and then passing it through the text encoding module 1. Similarly, the image feature V1 is obtained by adding V0 and V_cross0 features and then passing it through the image encoding module 1.
[0040] The deep_fusion structure will add short-circuit branches that add T_cross0 to the T1 features and V_cross0 to the V1 features. Since T_cross0 and V_cross0 contain features from shallow text and image modal fusion, adding T_cross0 and V_cross0 to the deeper T1 and V1 features constitutes the branch that adds shallow features to deep features, enabling the network to better learn the key features of small tennis balls. This branch will be repeated four times, adding branches from shallow features to deep features in text and image encoding modules 1, 2, 3, and 4 respectively.
[0041] During network training, the detailed information of smaller targets is often preserved more completely in the shallow layers of the network, containing more pixel information. Fine-grained information, such as color, texture, edges, and corner details, gradually decreases as the network depth increases. Therefore, to better detect small targets, a method can be adopted to retain more shallow features in the deeper layers of the network. The core idea of the multimodal deep fusion module, deep_fusion, is based on the fusion features of shallow layers, better incorporating the small target feature information from the shallow layers into the deep network. This helps the network learn small target features and effectively improves the accuracy of tennis ball detection.
[0042] Step 4: Input the image containing the tennis ball target into the multimodal object detection network for forward inference. Based on the result of the forward inference, draw the tennis ball detection box on the original image to obtain a visual representation of the tennis ball detection result. Figure 4 The left side shows the visualization effect of the GLIP network, and the right side shows the network detection effect of the embodiment of this application.
[0043] In summary, this application addresses the challenges of high speed, small size, and susceptibility to deformation inherent in small tennis balls. Building upon the improved GLIP model, it refines the multimodal query architecture and designs a novel multimodal deep fusion module, deep_fusion. This application first collects actual tennis ball data, inputs the data into the network structure to derive results, and then labels the detected tennis balls.
[0044] This application proposes a novel multimodal query that replaces the original text query architecture. It allows text features to perceive visual details, improves the fineness of the query, better uncovers detailed information about small targets, and enhances the overall performance of network detection.
[0045] A multimodal deep fusion module, deep_fusion, was also proposed to replace the original multimodal fusion module, fusion. It adds a branch that adds shallow features to deep features, enabling the network to better learn the key features of small tennis balls.
[0046] Through the above improvements, the average AP precision of the original multimodal target detection GLIP on the tennis dataset was increased to 75.4%, based on an average AP precision of 72.8%. Overall, the network design effectively detects small targets, such as tennis balls, meeting the needs of practical applications.
Claims
1. A method for constructing a network model for small object detection, the network model being based on a GLIP architecture, characterized in that: The original text query method is replaced with a multimodal query method, and a cross-modal multi-head attention mechanism is used to make the text features include visual guidance, and the text features can perceive visual details. The input image features are denoted as V0, and the input text features are denoted as T0; Image features V0 and text features T0 are processed through a cross-modal multi-head attention mechanism to obtain image features V_cross0 and text features T_cross0 that fuse text features; Text feature T0 is added to text feature T_cross0 and then processed by the text encoding module to obtain text feature T1. Image feature V0 is added to image feature V_cross0 and then processed by the image encoding module to obtain image feature V1. Add the first short-circuit branch that adds text feature T_cross0 to text feature T1 and the second short-circuit branch that adds image feature V_cross0 to image feature V1; The first and second short-circuit branches each have four branches, which are used to better incorporate shallow small target feature information into the deep network.
2. The method for constructing a network model for small target detection according to claim 1, characterized in that: Text features and image features interact through a cross-modal multi-head attention mechanism, and the extracted text features include guidance from the image features.
3. A method for constructing a network model for small target detection according to claim 1 or 2, characterized in that: In the multimodal fusion module of the network model, a branch is added to add shallow features to deep features, thus forming a multimodal deep fusion module. This enables the network to better learn the key features of small targets, and the key feature information is retained in the deep structure.
4. The application of a network model for small object detection as described in any one of claims 1 to 3 in tennis ball detection.
5. The application according to claim 4, characterized in that: Step 1. Read tennis ball images in real time; Step 2. Input the tennis ball image into the network model to perform the forward inference process; Step 3. Enter the network to determine if there is a tennis ball in the image. If there is a tennis ball, proceed to step 4; otherwise, proceed to step 5. Step 4. Label the tennis ball and indicate that a tennis ball has been detected in the image; Step 5. Continue to input unread images into the network model. If there are unread images, return to step 1; otherwise, end the detection.