Method and system for detecting a shooter in a fixed-view basketball event video and medium

By combining local optical flow and target detector, the shooting segments are extracted and the temporal information and feature localization are used to solve the problems of accuracy and real-time performance in basketball video shooting detection, and efficient shooting person recognition is achieved.

CN119091502BActive Publication Date: 2026-06-12CENT SOUTH UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CENT SOUTH UNIV
Filing Date
2024-08-16
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing basketball shot detection methods lack accuracy and real-time performance in complex dynamic scenes. Traditional feature detection algorithms are easily affected by lighting and background interference. Deep learning models struggle to balance efficiency and real-time performance, and multi-level information fusion is insufficient.

Method used

The local optical flow method is used to extract shooting segments, which are then detected by a trained target detector. Redundant frames are removed by frame skipping to accelerate the processing. The timing of the shot is located using temporal information and local motion features. The optimal shooter is found by combining confidence and crossover ratio.

🎯Benefits of technology

It improves the accuracy and real-time performance of shooter detection, enabling stable operation in complex scenarios and achieving high-accuracy real-time detection.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119091502B_ABST
    Figure CN119091502B_ABST
Patent Text Reader

Abstract

The application discloses a fixed-view-angle basketball game video shooting person detection method and system, and a medium, wherein the method comprises the following steps: acquiring a basketball game video, and intercepting a shooting segment from the basketball game video by using a local optical flow method; performing target detection on the intercepted shooting segment by using a trained target detector; removing redundant video frames by using frame skipping to accelerate target detector reasoning; locating a shooting moment based on a target detector reasoning result and a local motion feature; finding a shooting person from the shooting moment by using time sequence information to traverse a result array in a reverse order; finding a shooting person with the highest confidence based on the confidence; finding a shooting person with the highest quality based on an intersection over union; and finally obtaining a shooting person with the highest confidence and the best quality. The application uses time sequence local motion features to perform rule judgment to realize high-accuracy shooting person detection. The local optical flow method is used to intercept a shooting segment, and frame skipping is used to realize target detector reasoning acceleration, so that real-time detection effect is achieved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image processing technology, and in particular to a method, system, and medium for detecting shooters in basketball game videos from a fixed perspective. Background Technology

[0002] Real-time analysis and intelligent evaluation of basketball games have become an important direction in modern sports technology research. The widespread application of video surveillance technology in sports events has made it possible to automatically detect and identify player movements through video analytics. Especially for the detection of shooting motions, this not only provides valuable technical feedback to coaches and players but also enhances the viewing experience for spectators and the professionalism of media broadcasts.

[0003] Traditional basketball shot detection methods mainly employ feature-based detection algorithms, which, while meeting requirements to some extent, often suffer from the following problems:

[0004] 1. Limitations of feature detection algorithms: Traditional feature-based detection algorithms (such as edge detection and motion trajectory analysis) are easily affected by factors such as video resolution, lighting conditions, and background interference, resulting in poor detection accuracy and robustness. 2. Complexity of dynamic scenes: In basketball game scenes, the rapid movement and complex changes in players' actions require detection algorithms to have strong dynamic adaptability.

[0005] In recent years, with the development of deep learning technology, convolutional neural networks (CNNs) have performed exceptionally well in image and video analysis. In particular, deep learning-based object detection algorithms, such as YOLO (You Only Look Once) and Faster R-CNN, have achieved remarkable results in various visual tasks. However, directly applying these algorithms to detect people shooting basketballs still faces some challenges:

[0006] 1. High efficiency requirement: Real-time processing of game videos places high demands on the speed of the algorithm. Traditional deep learning models often struggle to achieve high accuracy while maintaining real-time performance. 2. Multi-level information fusion: Shot detection not only needs to focus on the player's action characteristics, but also needs to combine scene context information, such as the position of the basketball hoop and the spatial relationship between the player and the basketball hoop. Summary of the Invention

[0007] To address the shortcomings of existing technologies, the present invention aims to provide a method, system, and medium for detecting shooters in fixed-view basketball game videos, which significantly improves the accuracy and real-time performance of shooter detection, can work stably in complex dynamic scenes, and provides strong technical support for basketball video analysis.

[0008] Firstly, a method for detecting shooters in fixed-view basketball game videos is provided, including the following steps:

[0009] Acquire basketball game videos and extract shooting segments from them using local optical flow.

[0010] The trained target detector is used to detect targets in the captured shooting segments;

[0011] Accelerate target detector inference by removing redundant video frames through frame skipping;

[0012] Based on the inference results of the target detector and the local motion features, the shooting time is located. The shooting time is used to traverse the result array in reverse order using the temporal information to find the shooter. The shooter with the highest confidence is found based on the confidence level, and the shooter with the highest quality is found based on the intersection-union ratio. Finally, the shooter with the highest confidence and the best quality is obtained.

[0013] Furthermore, the process of extracting shooting segments from a basketball game video using the local optical flow method includes:

[0014] The basketball game video is read frame by frame, and the pre-defined basketball hoop area in the video frame is converted into a grayscale image;

[0015] The Farneback optical flow algorithm is used to calculate the pixel optical flow vector between two adjacent grayscale images. The magnitude of the average optical flow vector in the basket area is calculated to determine whether the net is swinging.

[0016] If the magnitude of the average optical flow vector is greater than the magnitude threshold, it is considered that the net swing has occurred. The moment when the net swing occurs is the moment when the shot is made. A shooting segment is obtained by extracting a preset number of video frames before and after the moment when the shot is made, and the number of frames at the moment when the shot is made in the shooting segment is saved.

[0017] Furthermore, the target detector is trained using the following method:

[0018] Collect basketball game videos and annotate each frame in the videos to obtain a training sample dataset;

[0019] The YOLOv9-based object detector was trained using the training dataset to obtain the final object detector.

[0020] Furthermore, the training sample dataset is obtained through the following method:

[0021] Collect basketball game videos, remove video frames that do not contain players, and use the number of players to remove invalid video frames;

[0022] The locations and categories of targets in the remaining video frames are labeled, including basketballs, backboards, rims, players, and shooters;

[0023] Data augmentation is performed on the labeled video frames, including random cropping, rotation, scaling, and color transformation, to obtain the training sample dataset.

[0024] Furthermore, the step of accelerating target detector inference by removing redundant video frames through frame skipping includes:

[0025] Read the number of frames at the moment of a successful shot in the shooting segment, reverse the preset number of frames based on the video frame at the moment of a successful shot, and select the video frames to be detected in the forward order up to the video frame at the moment of a successful shot. Then, input the selected video frames into the target detector for target detection.

[0026] Furthermore, the step of locating the shooting moment based on the inference results of the target detector combined with local motion features specifically includes:

[0027] The target detection results for each frame are stored in the result array X. The contact area between the basketball and the hoop is calculated to determine whether there is contact between them. When there is contact, it is considered a shooting moment, and the video frame at the shooting moment and the basketball's position information P relative to the hoop are recorded. b .

[0028] Furthermore, the shooter with the highest confidence and best quality is obtained through the following process;

[0029] The result array X is traversed frame by frame in reverse order from the moment of the shot;

[0030] When a video frame containing a shooter is encountered, the shooter with the highest confidence level is selected from that frame, and the positional information P of that shooter relative to the basket is calculated. s Determine the positional information P of the shooter relative to the basket. s Information P about the basketball's position relative to the basket at the moment of shooting b If the frames are the same, then continue to traverse the video frames in reverse order; if they are the same, then determine whether the confidence level of the shooter is greater than the set confidence level threshold for the shooter. If so, then consider it a valid shooter and record the video frame; otherwise, continue to traverse the video frames in reverse order.

[0031] Starting from the video frames containing valid shooters, iterate backwards through a preset number of frames to find the video frames where the shooter's confidence is greater than the shooter's confidence threshold and the shooter's crossover ratio with the surrounding players is the lowest. This will yield the shooter with the highest confidence and best quality.

[0032] Furthermore, in the process of searching for the shooter by traversing the result array in reverse order using the timing information from the shooting moment, if no video frame with a confidence threshold greater than that of the shooter is found during the reverse traversal, i.e., no shooter is found, the confidence threshold of the shooter is reduced according to the set attenuation coefficient, and the reverse traversal is performed again to search for the shooter; this process is iterated until the shooter is found or the confidence threshold of the shooter is lower than the lowest confidence threshold of the shooter.

[0033] Secondly, a system for detecting shooters in fixed-view basketball game videos is provided, including:

[0034] A memory on which computer programs are stored;

[0035] A processor is used to load and execute the computer program to implement the method for detecting shooters in fixed-view basketball game videos as described above.

[0036] Thirdly, a computer-readable storage medium is provided, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the method for detecting shooters in fixed-view basketball game videos as described above.

[0037] This invention proposes a method, system, and medium for detecting shooters in basketball game videos from a fixed perspective, which has the following beneficial effects:

[0038] 1. In order to solve the problem of finding the shooter in a complex basketball scene, this invention uses a target detector, trains the target detector with pre-collected basketball game video data, and uses the target detector to automatically detect the shooter.

[0039] 2. To address the issues of low accuracy and slow processing speed in basketball shooter recognition, this invention uses temporal local motion features for rule-based judgment to achieve high accuracy in basketball shooter detection; it utilizes local optical flow to extract shooting segments and employs frame skipping and early termination to accelerate the inference of the target detector, thereby achieving real-time detection. Attached Figure Description

[0040] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0041] Figure 1 This is a flowchart of the method for detecting shooters in fixed-view basketball game videos provided in an embodiment of the present invention. Detailed Implementation

[0042] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be described in detail below. Obviously, the described embodiments are merely some embodiments of this invention, and not all embodiments. Based on the embodiments of this invention, all other implementation methods obtained by those skilled in the art without creative effort are within the scope of protection of this invention.

[0043] To improve the accuracy of shooter recognition in basketball videos and achieve real-time shooter identification, such as... Figure 1 As shown, this invention proposes a method for detecting shooters in fixed-view basketball game videos based on programmable gradient information and a generalized efficient layer aggregation network, comprising the following steps:

[0044] S1: Training data acquisition.

[0045] The video data used in this embodiment were all captured by a 1080P high-definition camera at a fixed position. In order to train a high-precision basketball player recognition model and obtain basketball video data based on user characteristics and needs, the 1080P high-definition camera was deployed to a fixed position on a court with professional basketball game qualifications after being designated by professionals.

[0046] Basketball videos are collected, and professional annotators use X-AnyLabeling to label the basketball, rim, backboard, players, and shooters, using the category and location information as label data. The labeling criteria are: video frames without any players are discarded, and the model does not need to learn background information of a fixed scene. To further improve training speed and effectiveness, in some preferred embodiments, video frames with fewer than five players are discarded, but frames containing a shooter can be directly selected. A shooter is a type of player with more specific characteristics; the selected video frame should show the moment the arm is raised above the head or even above the head at the instant of the shot. The size and position of the label box need to be close to and directly facing the target. Even if the target is occluded, it should still be labeled. Small targets also need to be labeled; targets that are too blurry and lack features do not need to be labeled.

[0047] S2: Target detector training.

[0048] After acquiring sufficient and abundant basketball video data, the video data is divided and labeled according to a specified frame rate set by professionals. This results in a sufficient number of high-definition basketball video frames with high discrimination, effectively avoiding problems such as poor model recognition ability caused by imbalanced class distribution in the dataset during deep learning model training. The completed basketball training dataset is used to train the target detector, thereby obtaining a high-precision, fast-inference-speed model for recognizing shooters in basketball videos, which can be directly applied to subsequent shooter detection processes. The specific steps are as follows:

[0049] S21: Prepare training data, including images and corresponding labels, usually bounding boxes and their categories. The common format is COCO format. Perform data augmentation on the images, including random cropping, rotation, scaling, color transformation, etc., to increase the diversity of the data and improve the robustness of the model.

[0050] S22: Scale the image to a fixed size to fit the network's input requirements, normalize pixel values ​​to the [0,1] range, and convert the relative coordinates of the bounding box and the one-hot encoding of the class label.

[0051] S23: The object detector is built on YOLOv9. Input image data is fed into a GELAN network to extract multi-scale features. To alleviate the information bottleneck caused by increasing model depth, programmable gradient information is added during feature propagation. The main branch is used for inference, while auxiliary invertible branches address issues arising from neural network depth. Multi-level auxiliary information handles error accumulation caused by deep supervision. Programmable gradient information prevents feature loss due to excessive network depth. A generalized, efficient layer aggregation network is used to achieve model lightweighting, improving inference speed and detection accuracy. Bounding box regression and class prediction are performed based on feature maps, outputting feature maps at multiple scales, each corresponding to a target of different sizes.

[0052] S24: Calculate the position regression loss and size regression loss, use cross-entropy loss to calculate the classification loss of the target category, and calculate the confidence loss to distinguish between foreground and background;

[0053] S25: Calculate the classification loss using BCE Loss, the regression loss using DFL Loss and CIoU Loss, update the model parameters using the Adam optimizer, update the model parameters once for each batch of data, and then use the learning rate scheduler to adjust the learning rate according to the training progress to ensure model convergence and stability.

[0054] S3: Acquire basketball game videos and extract shooting segments from the videos using local optical flow.

[0055] Local optical flow is based on the assumption that the brightness value of the same pixel in adjacent frames remains unchanged over a short period of time in an image sequence, i.e., the brightness consistency assumption. It estimates the pixel's motion vector by measuring the brightness changes within local regions, specifically as follows:

[0056] f(x) = x T Ax+b T x+c

[0057] g(x)=f(x)+f(x·d)

[0058] E(d)=∫[g(x)-f(x+d)] 2 dx

[0059] In the formula, f(x) represents the pixel brightness in the image at time t, and x is the coordinate vector (x,y) in two-dimensional space. T A is a symmetric matrix that describes the second derivative information of the image in this region, b is a vector that represents the first derivative of the image, c is a constant that represents the average brightness of the region, g(x) represents the pixel brightness in the image at time t+Δt, f(x) is the gradient of the image f(x), d is the motion vector, i.e., optical flow, and E(d) represents the error function. The optical flow d can be obtained by minimizing the error function.

[0060] The game video is read frame by frame, and the region of interest (ROI) in each frame is converted into a grayscale image. Here, the ROI refers to the position of the basketball hoop. The Farneback optical flow algorithm is used to calculate the pixel optical flow vector between two grayscale images. The magnitude of the average optical flow vector in the basketball hoop region is calculated to determine if there is net sway. If the magnitude of the average optical flow vector is greater than a threshold, net sway is considered to have occurred. The moment of net sway is the moment the shot is made. After obtaining the moment of the shot, a predetermined number of video frames before and after the moment of the shot are extracted to obtain a shooting segment. In this embodiment, 125 frames before and 125 frames after the moment of the shot are selected to obtain a shooting segment, and the number of frames at the moment of the shot is saved as a txt file.

[0061] S4: Detect targets in the captured shooting segments using a trained target detector.

[0062] The object detector is built on YOLOv9 and features programmable gradient information and a generalized efficient layer aggregation network. The programmable gradient information mainly consists of three components: the main branch, the auxiliary invertible branch, and multi-level auxiliary information. The inference process of the programmable gradient information only uses the main branch, thus requiring no additional inference cost. The auxiliary invertible branch is designed to address the problems caused by deepening the neural network, which leads to information bottlenecks, preventing the loss function from generating reliable gradients. The multi-level auxiliary information addresses the error accumulation problem caused by deep supervision and is particularly suitable for architectures with multiple prediction branches and lightweight models. The generalized efficient layer aggregation network combines two neural network architectures, CSPNet and ELAN, which employ gradient path planning, resulting in a neural network that balances lightweight design, inference speed, and accuracy.

[0063] More specifically, the object detector architecture is as follows:

[0064] The object detector mainly consists of three parts: a Generalized ELAN (GELAN) as the backbone network, a Feature Pyramid Network (FPN) as the neck framework, and Programmable Gradient Information (PGI) as the information augmentation module. In the object detection algorithm, the backbone primarily determines the feature representation capability, and its design has a crucial impact on inference efficiency because it bears a significant computational cost. The neck is used to aggregate low-level physical features with high-level semantic features, and then build pyramid feature maps at each level. The information augmentation module addresses the problem of slow convergence speed or poor convergence performance in deep neural networks.

[0065] GELAN Backbone:

[0066] GELAN is a novel lightweight, fast, and accurate network architecture composed of CSPNet with gradient path planning and ELAN. CSPNet is a deep learning network architecture primarily used to improve the learning ability and efficiency of neural networks. CSPNet solves the problem of redundant computation in networks by using cross-stage partial connections, improving network accuracy and speed while maintaining computational cost. The core idea of ​​CSPNet is to divide the feature map into two parts: one part is directly passed to the next stage, while the other part undergoes further convolution operations before being merged with the directly passed part. This reduces redundant computation, improves efficiency, and maintains feature diversity, thereby improving network performance. The main structure of CSPNet includes splitting and merging, cross-stage connections, and partial stacking. The ELAN network module aims to improve the feature extraction capability and computational efficiency of neural networks. The ELAN module improves model performance while maintaining a lightweight network structure through efficient layer aggregation methods. The core idea of ​​the ELAN module is to aggregate features from different levels to fully utilize multi-scale information, thereby improving feature representation capabilities. ELAN achieves efficient feature fusion through multiple parallel convolution and aggregation operations. The basic structure of the ELAN module includes multi-branch parallel convolution, layer aggregation, pointwise convolution, and skip connections.

[0067] FPN neck network:

[0068] The Feature Pyramid Network module is designed to efficiently fuse feature maps from different scales, thereby improving the accuracy and robustness of object detection. The Feature Pyramid Network captures information at different scales in an image by extracting feature maps from different convolutional layers. These feature maps include high-resolution shallow features and low-resolution deep features. Shallow features contain more locational information and details, while deep features contain more semantic information. To achieve multi-scale feature fusion, the Feature Pyramid Network module includes upsampling and downsampling operations. Upsampling upscales low-resolution deep features to high resolution, aligning them with shallow features, while downsampling reduces the size of high-resolution feature maps to align them with low-resolution feature maps. Skip connections are a key component of the Feature Pyramid Network module; they directly pass feature maps between different layers, ensuring effective fusion of information at different scales, and can fuse high-resolution feature maps from shallow layers with upsampled deep feature maps. Feature fusion combines feature maps from different sources through concatenation or weighted summation. The fused feature map contains both detailed shallow features and semantically rich deep features. This fusion method enables the network to better handle targets at various scales, improving detection accuracy and robustness. The feature pyramid network module typically outputs feature maps at multiple scales, which are then fed into the detection head for target classification and bounding box regression.

[0069] PGI Information Enhancement Module:

[0070] Programmable Gradient Information (PGI) aims to address the information loss problem encountered during the training of deep neural networks. PGI enhances the training of the backbone network by providing additional gradient information, thereby improving the model's performance and accuracy. PGI consists of a backbone network, auxiliary reversible branches, and multi-level auxiliary supervision. The backbone network of PGI is the core network performing the object detection task; this backbone network is GELAN. The auxiliary reversible branches are a crucial part of PGI, working in parallel with the backbone network. They can fuse shallow features into deep features through reversible operations, alleviating the information bottleneck problem. PGI introduces auxiliary supervision signals at different levels, such as feature pyramid network loss and path aggregation network loss. These auxiliary loss functions provide additional gradient information, helping the model better learn multi-scale features. In PGI, data propagates forward through both the backbone network and the auxiliary invertible branch, generating detection results and auxiliary supervision signals. The main loss function and the auxiliary loss function are calculated, and the gradient of the total loss function is calculated through backpropagation. The gradient of the backbone network comes from the main loss function, while the gradient of the auxiliary invertible branch comes from the auxiliary loss function. The gradient of the auxiliary invertible branch is passed to the backbone network through a reversible operation to achieve gradient fusion and enhancement. The parameters of the backbone network and the auxiliary invertible branch are updated using the optimizer to complete one iteration.

[0071] Based on this, the target detection process of the target detector is as follows:

[0072] S41: Obtain the shooting clip, read the video frame data, and scale the video frame data to a fixed size (e.g., 640x640);

[0073] S42: Input the image data into the feature extraction network GELAN to obtain downsampled features. Perform downsampling at 8x, 16x, and 32x respectively to obtain features H applicable to the detection of small, medium, and large targets. m ;

[0074] S43: Obtain the downsampled features H m The input is fed into a feature pyramid for multi-scale feature fusion, constructing feature maps of different resolutions. The model can simultaneously detect targets of different sizes. By fusing high-resolution features from the low-level layer and strong linguistic features from the high-level layer, the detection accuracy of small targets is improved, resulting in the fused feature H′. m ;

[0075] S44: Fuse feature H′ m The input to the regression layer yields the position of the bounding box center relative to its network unit, the size of the bounding box, and the probability that the target object exists in the predicted box. The high-level features of the feature map are then input to the fully connected layer to obtain the category classification.

[0076] S5: Accelerate target detector inference by removing redundant video frames through frame skipping.

[0077] In a basketball shooting clip, extra video frames are introduced during and after the shot for better viewing and to ensure the video's integrity and smoothness. These extra video frames do not help in identifying the shooter and instead increase the model's time overhead. Therefore, video frames that are not useful for identifying the shooter can be skipped. When the shooter is detected, the model inference is stopped in advance to avoid the detection and inference of subsequent useless frames, thereby achieving real-time detection.

[0078] The txt file recording the successful shot moments of a basketball shot is read to obtain the frame number of the successful shot moment in the current shot segment, denoted as F1. Statistical analysis of existing shot segment test data shows that the time from the shooter's release to the basketball contacting the backboard is 1 to 2 seconds. The maximum value is taken and converted into a frame number; in this embodiment, the converted frame number is 50. Therefore, the frame number for skipping frames can be expressed as: F = F1 - 50. That is, 50 preset video frames are selected in reverse order based on the successful shot moment video frames, and the selected video frames are input into the target detector in forward order for target detection.

[0079] S6: Based on the object detector's inference results and local motion features, the shooting moment is located. Starting from the shooting moment, the result array is traversed in reverse order using temporal information to find the shooter. The shooter with the highest confidence is found, and the shooter with the highest quality is found using the intersection-over-union (IoU) ratio. Finally, the shooter with the highest confidence and best quality is obtained. When a shooter is detected, the detection flag is set to true when the shooter with the highest confidence and best quality is found by traversing 8 frames. The program stops early when the detection flag is true.

[0080] S61: Store the target detection results of each frame in the result array X. Determine whether the basketball and the hoop are in contact by calculating the contact area. When the basketball and the hoop are in contact, it is considered a shooting moment. Record the video frame of the shooting moment and the basketball's position information P relative to the hoop. b Since the video shooting angle is fixed, multiple areas can be obtained by pre-demarcating a dividing line based on the basketball hoop, with the area where the basketball is located serving as the location information; for example, dividing the basketball hoop into left and right areas with a line, the basketball's location in the left or right area serves as the location information. Specifically:

[0081] L=(max(x1A,x1B)≤min(x2A,x2B))and(max(y1A,y1B)≤min(y2A,y2B))

[0082] Where L is the result of whether the basketball is in contact with the basket, and the bounding boxes of the basketball and the basket are A and B respectively. Their bounding boxes are represented by the coordinates of the upper left corner and the lower right corner: A:(x1A,y1A,x2A,y2A), B:(x1B,y1B,x2B,y2B).

[0083] S62: When the basketball and the hoop are in contact, it is assumed that a shooting action may occur. The result array X is traversed frame by frame in reverse order from the moment the shooting action occurs. Before traversing frame by frame in reverse order, a maximum threshold for reverse traversal is set.

[0084] S63: When traversing video frames containing a shooter, select the shooter with the highest confidence from that video frame and calculate the shooter's orientation information P relative to the basket. s Determine the positional information P of the shooter relative to the basket. s Information P about the basketball's position relative to the basket at the moment of shooting b If the positions of the shooter and the basketball are the same, the interference from other players shooting off the court can be avoided. If not, continue to traverse the video frames in reverse order. If so, determine whether the confidence level of the shooter is greater than the set confidence level threshold for the shooter. If so, it is considered a valid shooter and the video frame is recorded. Otherwise, continue to traverse the video frames in reverse order.

[0085] S64: Starting from video frames containing valid shooters, iterate backwards for a preset number of frames (e.g., continuously reverse-ordering 8 frames) to find video frames where the shooter's confidence is greater than the shooter's confidence threshold and the shooter's crossover ratio (CVR) with the surrounding players is lowest. This yields the shooter with the highest confidence and best quality. The lower the CVR of the shooter and surrounding players, the less foreground is in the shooter's image (foreground here refers to players), and we consider the shooter's quality to be higher.

[0086] S65: In the process of searching for the shooter by traversing the result array in reverse order using the timing information from the shooting moment, if no video frame with a confidence threshold greater than that of the shooter is found during the reverse traversal, that is, no shooter is found, then the confidence threshold of the shooter is reduced according to the set attenuation coefficient, and the operation of steps S62-S64 is repeated to search for the shooter; this process is repeated until the shooter is found or the confidence threshold of the shooter is lower than the minimum confidence threshold of the shooter.

[0087] The above embodiments propose a method for detecting shooters in fixed-view basketball game videos, solving the problem of finding shooters in complex basketball scenes. This invention uses a target detector, trains it with pre-collected basketball game video data, and then uses the target detector to automatically detect shooters. It solves the problems of low accuracy and slow processing speed in shooter recognition. This invention uses temporal local motion features for rule judgment to achieve high accuracy in shooter detection. It uses local optical flow to extract shooting segments and uses frame skipping and early termination to accelerate the inference of the target detector, thereby achieving real-time detection.

[0088] Experimental dataset settings:

[0089] To fully verify the effectiveness, recognition accuracy, and real-time performance of the shooting person detection method in fixed-view basketball game videos, this experiment collected high-definition basketball video and image data from multiple stadiums and integrated them into a unified dataset for training the target detector using the processing method described above.

[0090] In this experiment, over 37,000 shooting images were selected from numerous basketball videos as the dataset for annotation. The annotation targets were basketball, hoop, backboard, player, and shooter. The dataset was partitioned in an 8:1:1 ratio, meaning that 80% of the images were randomly assigned to the training set, 10% to the validation set, and 10% to the test set. A single NVIDIA V100 graphics card was used to train the object detection model. Some parameters used in this experiment are shown in Table 1.

[0091] Table 1

[0092]

[0093]

[0094] Setting lower learning and data augmentation methods can effectively constrain model training and enhance the model's object detection performance.

[0095] Based on this experimental setup, the present invention trained the model and tested its average accuracy, shooter accuracy, and inference time (processing 10-second shooting segments) using existing video data after training. The test results are shown in Table 2.

[0096] Table 2

[0097] Evaluation indicators Test Results basketball average accuracy 0.94 Average accuracy of basketball hoop 0.995 Player average accuracy 0.992 Average accuracy of shooters 0.89 Shooter accuracy 91.47% Inference time (GPU) 1.226s Inference time (CPU) 2.903s

[0098] The experimental results show that this experiment achieved good results in terms of mAP, shooting accuracy, and inference time, proving that the target detection model can effectively and in real-time identify the shooter.

[0099] This invention also provides a system for detecting shooters in fixed-view basketball game videos, comprising:

[0100] A memory on which computer programs are stored;

[0101] A processor is used to load and execute the computer program to implement the method for detecting shooters in fixed-view basketball game videos as described above.

[0102] This invention also provides a computer-readable storage medium storing a computer program thereon, characterized in that the computer program, when executed by a processor, implements the method for detecting shooters in fixed-view basketball game videos as described above.

[0103] It is understood that the same or similar parts in the above embodiments can be referred to each other, and the contents not described in detail in some embodiments can be referred to the same or similar contents in other embodiments.

[0104] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0105] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0106] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0107] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0108] Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention. Those skilled in the art can make changes, modifications, substitutions and variations to the above embodiments within the scope of the present invention.

Claims

1. A method for detecting shooters in a fixed-view basketball game video, characterized in that, Includes the following steps: Acquire basketball game videos and extract shooting segments from them using local optical flow. The trained target detector is used to detect targets in the captured shooting segments; The target detector is built on YOLOv9. Input image data is fed into the GELAN network to extract multi-scale features. Programmable gradient information is added during feature propagation: its main branch is GELAN, used for the inference process; the auxiliary reversible branch works in parallel with the main branch, and the shallow features are fused into the deep features through reversible operations to solve the problems brought about by the depth of the neural network; multi-level auxiliary information introduces auxiliary supervision signals at different levels to deal with the error accumulation problem caused by deep supervision. Accelerate target detector inference by removing redundant video frames through frame skipping; Based on the inference results of the target detector and combined with local motion features, the shooting time is located. The shooting time is used to traverse the result array in reverse order using time sequence information to find the shooter. The shooter with the highest confidence is found based on the confidence level, and the shooter with the highest quality is found based on the intersection-union ratio. Finally, the shooter with the highest confidence and the best quality is obtained. The method of locating the shooting moment based on the inference results of the target detector combined with local motion features specifically includes: The target detection results for each frame are stored in the result array X. The contact area between the basketball and the rim is calculated to determine whether there is contact between them. When there is contact, it is considered a shooting moment, and the video frame at the shooting moment and the basketball's position relative to the rim are recorded. P b ; The shooter with the highest confidence and best quality was obtained through the following process: The result array X is traversed frame by frame in reverse order from the moment of the shot; When a video frame containing a shooter is encountered, the shooter with the highest confidence level is selected from that frame, and the positional information of that shooter relative to the basket is calculated. P s ;based on P s and P b The comparison results and the confidence threshold of the shooter are used to determine whether the shooter is a valid shooter. If so, the video frame is recorded; otherwise, the video frames are traversed in reverse order. Starting from the video frames containing valid shooters, iterate backwards through a preset number of frames to find the video frames where the shooter's confidence is greater than the shooter's confidence threshold and the shooter's crossover ratio with the surrounding players is the lowest. This will yield the shooter with the highest confidence and best quality.

2. The method for detecting shooters in fixed-view basketball game videos according to claim 1, characterized in that, The process of extracting shooting segments from basketball game videos using the local optical flow method includes: The basketball game video is read frame by frame, and the pre-defined basketball hoop area in the video frame is converted into a grayscale image; The Farneback optical flow algorithm is used to calculate the pixel optical flow vector between two adjacent grayscale images. The magnitude of the average optical flow vector in the basket area is calculated to determine whether the net is swinging. If the magnitude of the average optical flow vector is greater than the magnitude threshold, it is considered that the net swing has occurred. The moment when the net swing occurs is the moment when the shot is made. A shooting segment is obtained by extracting a preset number of video frames before and after the moment when the shot is made, and the number of frames at the moment when the shot is made in the shooting segment is saved.

3. The method for detecting shooters in fixed-view basketball game videos according to claim 1, characterized in that, The target detector is trained using the following method: Collect basketball game videos and annotate each frame in the videos to obtain a training sample dataset; The YOLOv9-based object detector was trained using the training dataset to obtain the final object detector.

4. The method for detecting shooters in fixed-view basketball game videos according to claim 3, characterized in that, The training sample dataset was obtained through the following method: Collect basketball game videos, remove video frames that do not contain players, and use the number of players to remove invalid video frames; The locations and categories of targets in the remaining video frames are labeled, including basketballs, backboards, rims, players, and shooters; Data augmentation is performed on the labeled video frames, including random cropping, rotation, scaling, and color transformation, to obtain the training sample dataset.

5. The method for detecting shooters in fixed-view basketball game videos according to claim 1, characterized in that, The method of accelerating target detector inference by removing redundant video frames through frame skipping includes: Read the number of frames at the moment of a successful shot in the shooting segment, reverse the preset number of frames based on the video frame at the moment of a successful shot, and select the video frames to be detected in the forward order up to the video frame at the moment of a successful shot. Then, input the selected video frames into the target detector for target detection.

6. The method for detecting shooters in fixed-view basketball game videos according to claim 1, characterized in that, The shooter with the highest confidence and best quality was obtained through the following process: The result array X is traversed frame by frame in reverse order from the moment of the shot; When a video frame containing a shooter is encountered, the shooter with the highest confidence level is selected from that frame, and the positional information of that shooter relative to the basket is calculated. P s ; Determine the position of the shooter relative to the basket. P s Information on the basketball's position relative to the basket at the moment of shooting P b If the frames are the same, then continue to traverse the video frames in reverse order; if they are the same, then determine whether the confidence level of the shooter is greater than the set confidence level threshold for the shooter. If so, then consider it a valid shooter and record the video frame; otherwise, continue to traverse the video frames in reverse order. Starting from the video frames containing valid shooters, iterate backwards through a preset number of frames to find the video frames where the shooter's confidence is greater than the shooter's confidence threshold and the shooter's crossover ratio with the surrounding players is the lowest. This will yield the shooter with the highest confidence and best quality.

7. The method for detecting shooters in fixed-view basketball game videos according to any one of claims 1 to 6, characterized in that, In the process of searching for the shooter by traversing the result array in reverse order using the timing information from the moment of the shot, if no video frame with a confidence threshold greater than that of the shooter is found during the reverse traversal, i.e., no shooter is found, the confidence threshold of the shooter is reduced according to the set attenuation coefficient, and the reverse traversal is performed again to search for the shooter. This process is repeated until the shooter is found or the confidence threshold of the shooter is lower than the minimum confidence threshold of the shooter.

8. A system for detecting shooters in fixed-view basketball game videos, characterized in that, include: A memory on which computer programs are stored; A processor for loading and executing the computer program to implement the method for detecting shooters in a fixed-view basketball game video as described in any one of claims 1 to 7.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, The computer program also implements, when executed by the processor, the method for detecting shooters in fixed-view basketball game videos as described in any one of claims 1 to 7.