A basketball video highlight automatic generation method and system based on multi-modal feature fusion

By employing multimodal feature fusion and cross-modal attention mechanisms, the problem of inaccurate boundary localization of basketball video highlights was solved, enabling efficient automated video highlight generation and improving the viewing quality of basketball game videos.

CN122269082APending Publication Date: 2026-06-23CHONGQING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHONGQING UNIV OF POSTS & TELECOMM
Filing Date
2026-03-31
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing technologies struggle to achieve deep alignment of audiovisual semantics, resulting in inaccurate boundary positioning of exciting basketball game clips and affecting the viewing quality of video highlights.

Method used

A multimodal feature fusion method is adopted, which uses a bidirectional long short-term memory network and a cross-modal attention mechanism to temporally encode visual and audio features, and combines a one-dimensional convolutional neural network for probability prediction and boundary regression to achieve deep alignment and accurate boundary localization of audiovisual semantics.

Benefits of technology

It significantly improves the accuracy of boundary positioning for highlights, enables fully automated processing from raw video to highlight reels, and improves the efficiency of video content production.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122269082A_ABST
    Figure CN122269082A_ABST
Patent Text Reader

Abstract

The application relates to a basketball video highlight automatic generation method and system based on multi-modal feature fusion, and belongs to the technical field of computer vision and multimedia intelligent analysis. The method comprises the following steps: S1, extracting and aligning a visual feature sequence and an audio feature sequence; S2, using bidirectional LSTM to perform time sequence coding on the audio-visual feature, constructing a cross-modal attention mechanism taking audio as a query and vision as a key value, and obtaining multi-modal fusion features; S3, inputting the fusion features into a 1D CNN to output a frame-by-frame highlight event probability sequence; S4, performing peak value searching based on an adaptive threshold of the probability sequence to obtain a coarse-grained timestamp; S5, taking the coarse-grained timestamp as a center to intercept local features, inputting the local features into bidirectional LSTM to predict starting and ending offsets, and obtaining accurate start and end time boundaries; and S6, clipping and splicing to generate a highlight video according to the accurate boundaries. The application realizes audio-visual semantic depth alignment through an audio-guided visual attention mechanism, and improves boundary positioning accuracy through a coarse-precision two-stage positioning architecture.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of computer vision and multimedia intelligent analysis technology, and is a method and system for automatically generating basketball video highlights based on multimodal feature fusion. Background Technology

[0002] With the widespread dissemination of sporting events and the explosive growth of video data, how to automatically and accurately extract and locate highlights from lengthy match videos has become a crucial issue in the field of sports video analytics. Traditional methods relying on manual editing are not only inefficient and costly, but also highly subjective and lack timeliness.

[0003] To address this issue, deep learning-based automatic video analysis techniques have emerged. Existing technologies have attempted to extract visual features from videos using convolutional neural networks (such as ResNet) and combine them with temporal models (such as LSTM) to model actions. However, these single-modal methods often struggle to comprehensively capture all information about key events; for example, crucial audio signals such as audience cheers and referee whistles are frequently overlooked, resulting in low recall rates for locating key events.

[0004] To address this, some studies have begun to introduce multimodal information fusion to improve detection performance. For example, Chinese patent CN114519809A discloses an audiovisual video parsing method that captures multi-scale semantics through a cross-modal temporal convolutional attention network, aiming to identify and locate events in audio and video. Another example is Chinese patent CN115035441A, which proposes a specular video recognition method that solves the problem of modal feature misalignment by splicing audio and video encoded features and utilizing a self-attention mechanism. However, most of these methods, when performing multimodal fusion, employ equal bidirectional interaction or simple splicing, failing to fully utilize the sparse burstiness of audio signals (such as cheers) and the temporal causal relationship of visual events, resulting in inaccurate audiovisual semantic alignment and a high likelihood of false detections.

[0005] Furthermore, regarding the accuracy of highlight reel localization, existing technologies, such as the audiovisual event localization system disclosed in Chinese patent CN119152337A based on cross-modal consistency and temporal multi-granularity collaboration, can handle events of varying durations. However, their localization results are typically products of the overall model output, lacking a refined mechanism for event boundary processing. This makes it difficult to balance high recall in event detection with high accuracy in boundary localization. Particularly for exciting actions in basketball games with ambiguous start and end times and significant duration variations (such as landing after a dunk or celebration), boundary localization often suffers from substantial errors, directly impacting the final quality of the generated highlight reel.

[0006] Therefore, how to achieve deep alignment of audiovisual semantics and improve the accuracy of boundary positioning of exciting segments is a technical problem that urgently needs to be solved in this field. Summary of the Invention

[0007] In view of this, the purpose of this invention is to overcome the shortcomings of the prior art and provide a method and system for automatically generating basketball video highlights based on multimodal feature fusion, which can achieve audiovisual semantic depth alignment and improve the accuracy of highlight segment boundary positioning.

[0008] To achieve the above objectives, the present invention provides the following technical solution: An automatic generation method for basketball video highlight clips based on multimodal feature fusion includes the following steps: S1 Feature extraction: extracting visual and audio feature sequences from the basketball video and aligning them temporally; S2 Feature fusion: using a bidirectional long short-term memory network to temporally encode the audio and video features, and constructing a cross-modal attention mechanism with audio modality as the query and visual modality as the key and value to obtain a multimodal fused feature sequence; S3 Probability prediction: inputting the fused feature sequence into a one-dimensional convolutional neural network to output a frame-by-frame probability sequence of highlight events; S4 Coarse-grained localization: performing peak lookup based on an adaptive threshold of the probability sequence to obtain coarse-grained timestamps of candidate highlight clips; S5 Precise localization: extracting local fused feature clips centered on each coarse-grained timestamp, inputting them into a boundary regression network to predict the start and end offsets to obtain precise start and end time boundaries; S6 Highlight generation: editing and splicing clips from the original video according to the precise start and end time boundaries to generate a highlight video.

[0009] Preferably, in S1, visual feature extraction uses a ResNet-50 network pre-trained on ImageNet, fine-tuned using a basketball dataset to extract a 2048-dimensional feature vector from the output of the global average pooling layer; audio feature extraction resamples the audio to a 22,050Hz mono signal. Frames are segmented according to the sliding window step size, and 13-dimensional MFCC and its first-order and second-order differences are extracted to form a 39-dimensional audio feature vector.

[0010] Preferably, the cross-modal attention mechanism in S2 uses a linear projection matrix. Map the hidden audio state to a query vector Mapping the visual hidden state sequence to a key matrix Sum matrix Calculate the scaled dot product attention score Attention weights are obtained by softmax normalization. Finally, the weighted summation yields the fusion feature. .

[0011] Preferably, S3 uses frame-by-frame binary tags. As a monitoring signal, the binary cross-entropy loss function is used. Perform end-to-end training; adaptive thresholding in S4 ,in and Let the mean and standard deviation be the probability sequence. The adjustment factor ranges from 0.5 to 2; in S5, it uses coarse-grained timestamps. Cut length at center Local feature segments of seconds, using the L1 loss function Training the boundary regression network.

[0012] Corresponding to the above method, the present invention also provides an automatic generation system for basketball video highlight clips based on multimodal feature fusion. This system includes: The feature extraction module is used to extract visual feature sequences and audio feature sequences from basketball videos and align them in time. The feature fusion module is used to perform temporal encoding on the visual feature sequence and the audio feature sequence using a bidirectional long short-term memory network to obtain a visual hidden state sequence and an audio hidden state sequence, and to construct a cross-modal attention mechanism with audio modality as query and visual modality as key and value to obtain a multimodal fused feature sequence. The probability prediction module is used to input the multimodal fused feature sequence into a one-dimensional convolutional neural network and output a frame-by-frame probability sequence of exciting events. The coarse-grained positioning module is used to perform peak search based on the adaptive threshold of the probability sequence of the exciting events, and obtain the coarse-grained timestamp of the candidate exciting segments; The precise positioning module is used to extract local fusion feature segments centered on each of the coarse-grained timestamps, input them into the boundary regression network, predict the start and end offsets of the exciting segments relative to the center timestamp, and thus obtain the precise start and end time boundaries on the original video timeline. The highlight generation module is used to cut segments from the original basketball video according to the precise start and end time boundaries and splice them together to generate a highlight video.

[0013] The present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the above-described method.

[0014] Compared with the prior art, the beneficial effects of the present invention are as follows: First, in terms of modal fusion, this invention adopts a cross-modal attention mechanism of "audio-guided vision". It uses the sparse burstiness of audio signals as a query to retrieve and focus on the visual action that triggered the sound in the historical visual sequence. It makes full use of the causal relationship between audiovisual information in time sequence, realizes deep alignment and accurate fusion of audiovisual semantics, and effectively solves the problems of inaccurate modal fusion and false detection in the prior art.

[0015] Secondly, in terms of positioning accuracy, this invention adopts a "coarse-fine dual-stage positioning" architecture. First, it quickly locates the center of candidate segments by using adaptive threshold and peak search. Then, it refines the local features through a boundary regression network to accurately predict the start and end time offsets. This effectively overcomes the shortcomings of a single model that cannot balance recall and boundary accuracy, and significantly improves the accuracy of positioning the boundaries of exciting segments. It is particularly suitable for exciting actions in basketball games with ambiguous start and end times and large duration variations.

[0016] Third, in terms of automation, this invention achieves end-to-end fully automated processing from raw video input to highlight output, without the need for manual intervention, which greatly improves the efficiency of video content production and has broad application prospects.

[0017] Other advantages, objectives, and features of the invention will be set forth in part in the description which follows, and in part will be apparent to those skilled in the art from the following examination, or may be learned from practice of the invention. The objectives and other advantages of the invention can be realized and obtained through the following description. Attached Figure Description

[0018] To make the objectives, technical solutions, and advantages of the present invention clearer, the preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings, wherein: Figure 1 This is a schematic diagram of the system architecture of the present invention; Figure 2 This is a schematic diagram of the audio-guided visual cross-modal attention mechanism of the present invention; Figure 3 A schematic diagram of the two-positioning architecture process of this invention. Detailed Implementation

[0019] The following specific examples illustrate the implementation of the present invention. Those skilled in the art can easily understand other advantages and effects of the present invention from the content disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments, and various details in this specification can be modified or changed based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that the illustrations provided in the following embodiments are only schematic representations of the basic concept of the present invention. Unless otherwise specified, the following embodiments and features can be combined with each other.

[0020] The accompanying drawings are for illustrative purposes only and are schematic diagrams, not actual pictures. They should not be construed as limiting the invention. To better illustrate the embodiments of the invention, some parts in the drawings may be omitted, enlarged, or reduced, and do not represent the actual product dimensions. It is understandable to those skilled in the art that some well-known structures and their descriptions may be omitted in the drawings.

[0021] This invention provides a method for automatically generating basketball video highlights based on multimodal feature fusion, the system architecture of which is as follows: Figure 1 As shown. The method includes the following steps: S1: Feature Extraction The feature extraction module extracts visual and audio feature sequences highly correlated with highlights from the original basketball game video and precisely aligns them temporally. The visual and audio modalities together characterize the exciting moments of the game; visual information provides the core semantics of the event, while audio information provides immediate emotional feedback and temporal context.

[0022] First, visual feature extraction is performed. To obtain advanced visual semantic information, this embodiment uses a ResNet-50 network pre-trained on ImageNet as the base model and fine-tunes it using a constructed basketball dataset. The dataset is labeled as follows: binary labels are generated for each video frame based on the start and end times of the highlight events. 0 indicates that the frame falls within a normal event range, while 1 indicates that the frame falls within any exciting event range. After fine-tuning, the final fully connected layer of the ResNet-50 is removed, and the 2048-dimensional feature vector output from the global average pooling layer is extracted. The feature vectors of all frames are then arranged in chronological order to form a visual feature sequence. ,in, This represents the total number of frames in the video.

[0023] In some implementations, the base network used for visual feature extraction is not limited to ResNet-50; other deep convolutional neural networks pre-trained on large image datasets, such as ResNet-101, ResNeXt, EfficientNet, or Vision Transformer, can also be used. Fine-tuning strategies can be flexibly adjusted according to actual needs. For example, staged fine-tuning can be used: first, freeze the parameters of the first few layers of the network, train only the last few layers, and then unfreeze all layers for overall fine-tuning after the model stabilizes, in order to prevent overfitting and accelerate convergence.

[0024] Next, audio feature extraction was performed. The accompanying audio of the video was processed using the Librosa audio processing library. The audio was uniformly resampled to a 22,050Hz mono signal. This is done to reduce redundancy and maintain the validity of audio information. To achieve precise alignment of audio features with video frames, the video frame rate is adjusted accordingly. The video is divided into frames per second. The duration of a single frame is... Seconds. Set the step size for audio sliding window analysis to... Each step size corresponds to one video frame. To facilitate efficient spectral analysis using the Fast Fourier Transform, the analysis window length is preferably set to be close to the sliding window step size and an integer power of 2 for the number of sampling points, so that there is partial overlap between adjacent audio frames, ensuring temporal continuity.

[0025] After completing the frame segmentation, for each frame of audio, the 13-dimensional basic MFCC coefficients are extracted to identify ordinary background noise and exciting event sounds with significant spectral structure, such as dunks, three-pointers, and audience outbursts.

[0026] Based on this, 13-dimensional Mel-frequency cepstral coefficients (MFCCs) are extracted. The first-order difference (13-dimensional) captures the critical moment when sound transitions from a stationary state to a sudden change, thus helping the system accurately locate the instant when "an event begins," such as a sudden cheer from the audience, the sound of the referee's whistle, or the brief impact sound of a basketball hitting the rim. The second-order difference (13-dimensional) characterizes the acceleration of sound changes, enabling not only to sense whether an event has occurred but also to understand the dynamic trajectory of sound energy enhancement, peak, and decay during the event.

[0027] The 13-dimensional basic MFCC coefficients, along with their first and second differences, are arranged synchronously according to the video frame timestamps to form an audio feature sequence synchronized with the visual feature sequence. .

[0028] In some implementations, the dimensionality of the audio features can be adjusted according to the actual effect. For example, only 13-dimensional MFCCs can be extracted without using differences, or higher-dimensional MFCC coefficients (such as 20-dimensional) and their differences can be extracted. The sliding window step size can also be set to less than [a certain value]. The value of allows one video frame to correspond to multiple audio frames. These audio features are then aggregated using average pooling or max pooling to achieve denser alignment. Conversely, increasing the stride yields sparser features and reduces computation.

[0029] S2: Feature Fusion The feature fusion module utilizes the sparse bursts of audio signals to guide visual information, achieving deep semantic alignment and fusion between the two. The detailed mechanism is as follows: Figure 2 As shown.

[0030] Considering the high information density of video frame sequences and the typically slow changes between consecutive frames, while the audio signals from the game (such as cheers from the audience and whistles from the referee) exhibit sparseness and explosiveness, with significant energy peaks at crucial moments. In basketball games, key visual actions (such as dunks and three-pointers) often precede the explosive cheers from the audience. This slight temporal delay indicates a strong semantic correlation between the audio signal marking a highlight and its preceding visual image. However, under conditions such as rapid movement, multiple players obscuring the view, camera shake, or insufficient lighting, the visual features extracted by ResNet-50 are prone to degradation, and relying solely on visual information can easily lead to missed highlights. To address this issue, this embodiment proposes a cross-modal attention mechanism that uses audio as the query and visual information as the key / value pair. When an explosive audio signal occurs, this mechanism can automatically retrieve and focus on the preceding visual action that triggered the sound from the historical visual sequence, thereby achieving accurate cross-modal temporal alignment and significantly improving the accuracy of highlight detection.

[0031] First, a bidirectional long short-term memory (Bi-LSTM) network is used to process visual feature sequences. and audio feature sequences Temporal encoding is performed. For any time step i, the visual hidden state with fused context information is obtained. With audio hidden state This transforms the originally isolated single-frame features into a coherent temporal representation.

[0032] Taking vision as an example, firstly, the forward LSTM processes the feature vectors according to the time order. Frame-by-frame encoding to obtain preorder dependency information; simultaneously, inverse LSTM is used to reverse the time sequence... Encode the hidden states and extract subsequent dependency information. Finally, concatenate and merge the hidden states obtained from the forward and backward iterations at the same time step.

[0033] Next, we construct a cross-modal attention mechanism for “audio-guided vision”.

[0034] At time step i, the audio is hidden bidirectionally. As the basic representation of the query vector, the visual bidirectional hidden state sequence As the underlying representation of keys and values, it utilizes a learnable linear projection matrix. By performing mapping, a query vector in a unified feature space is obtained. Key vector Sum value vector : ;in, Let be the bidirectional hidden state matrix of the visual sequence at all time steps.

[0035] Subsequently, the query vector is calculated using the scaled dot product attention method. The matrix composed of all visual key vectors similarity .in, The dimension of the key vector is used to scale the dot product result to avoid excessively large values ​​that could lead to gradient instability. Attention weights at each visual time step are then obtained by normalization using the softmax function. ,in, This represents the current audio context. Guided by this, the model processes visual sequences. The distribution of attention to different historical moments in China.

[0036] Then, for the visual value vector sequence According to attention weight We perform weighted summation to obtain the fused cross-modal features. Among them, weight The higher the value, the higher the time step. The stronger the semantic association between the visual information and the current audio signal.

[0037] Finally, the fusion features of all time steps are... Arrange them in order to obtain the complete multimodal fusion feature sequence. This sequence preserves both local visual and audio features and incorporates cross-modal contextual information. When an explosive audio signal occurs, it automatically retrieves and focuses on the visual action that triggered the sound from the historical visual sequence, achieving precise cross-modal temporal alignment.

[0038] To further enhance the temporal modeling capabilities of the features, a Transformer encoder layer can be added after or before the Bi-LSTM. The Transformer's self-attention mechanism can capture longer-range temporal dependencies, complementing the Bi-LSTM. Furthermore, the attention mechanism for "audio-guided vision" can be extended to multi-head attention. By using multiple parallel attention heads, each head can focus on audiovisual relationships in different semantic subspaces. For example, one head can focus on the association between player actions and cheers, while another head can focus on the association between referee whistles and fouls, thus obtaining richer multimodal fusion features.

[0039] Thus far, this method has completed the extraction and deep fusion of multimodal features through S1 to S2, obtaining a fused feature sequence containing rich audiovisual semantic information. Next, a two-stage localization architecture will be used to detect and refine the boundaries of key segments; the overall process is as follows: Figure 3 As shown, firstly, the center timestamp of candidate segments is quickly screened out through coarse-grained localization, and then boundary regression is performed on these candidate segments through precise localization to finally obtain the precise start and end times.

[0040] S3: Probability Prediction The probability prediction module converts the fused features into frame-by-frame probability of highlight events, forming a probability curve.

[0041] First, the multimodal fusion feature sequence Input a one-dimensional convolutional neural network (1D CNN). The 1D CNN captures the local dynamic structure of an event's occurrence through the local receptive field of its convolutional kernels in the temporal dimension, and outputs probability curves for key moments at each time point. The 1D CNN processes each original high-dimensional feature... Perform "decision normalization" to compress it into scalar probabilities. .

[0042] Secondly, the network uses a binary label sequence constructed using S1. As a monitoring signal, and using the binary cross-entropy loss function. Training is then performed. This loss function not only updates the parameters of the 1D CNN, but also optimizes the upstream Bi-LSTM encoder and attention projection matrix through end-to-end backpropagation. .

[0043] After training, for the input sequence, the network outputs a predicted probability sequence step by step. This forms a continuous probability curve of the level of excitement, reflecting the changing trend of the level of excitement throughout the game over time.

[0044] After the loss function is determined, this module uses backpropagation combined with gradient descent-like optimization methods to train the network end-to-end. Specifically, the binary cross-entropy (BCE) loss first calculates the gradients of the convolutional kernel weights and subsequent fully connected layer parameters of the 1D CNN, and then propagates them upstream along the computation graph. The parameters of each layer are iteratively updated in the negative gradient direction, causing the training loss to gradually converge to a smaller value over multiple iterations. Once training is complete and the network parameters have converged, the 1D CNN can perform temporal inference on fused feature sequences of arbitrary length. For the input sequence... The network will output the corresponding predicted probability at each time point. This forms a frame-by-frame probability sequence. .

[0045] In some implementations, 1D CNNs can be replaced with other temporal modeling networks, such as temporal convolutional networks (TCNs) or stacked gated recurrent units (GRUs). In addition to binary cross-entropy, Focal Loss can be introduced as a loss function to address the imbalance between positive and negative samples (highlights vs. ordinary samples), allowing the model to focus more on samples that are difficult to classify.

[0046] S4: Coarse-grained positioning The coarse-grained localization module quickly and adaptively locates the center timestamp of candidate highlight segments from the probability curve, and its output is the coarse-grained timestamp of the highlight segment. To overcome the fluctuations in the probability curve caused by differences in the pace of the game, environmental noise, and event density, and to avoid missed detections and false detections caused by fixed thresholds, an adaptive threshold based on global statistics is adopted.

[0047] Calculate probability sequence The first moment (i.e., the global mean) ) and the second central moment (i.e., standard deviation) Furthermore, drawing on the concept of statistical distribution centrality, an adaptive decision boundary is constructed. .in, It is a regulating factor used to control the balance between the false negative rate and the false positive rate.

[0048] After determining the threshold, peak detection is performed on the probability curve. For each time point, if its probability value is higher than the threshold and is the maximum value in its local neighborhood, then the time point is marked as a coarse-grained timestamp of the candidate highlight segment.

[0049] If k is too small (e.g., k=0.1), the threshold is too low, the judgment boundary is close to the mean, and the system is extremely sensitive to probability fluctuations. It is easy to misjudge small probability fluctuations caused by background noise or non-critical actions as exciting events, leading to an increase in the false detection rate. If k is too large (e.g., k>3), the threshold is too high, the judgment boundary is far from the mean, and the system only responds to extremely significant probability peaks, leading to an increase in the false negative rate. The preferred value range of k is 0.5-2, which can avoid false detections while ensuring sensitivity, so that the method of the present invention can work stably in various types of competition scenarios.

[0050] There are several variations in how adaptive thresholds are calculated. Besides using the global mean plus standard deviation, dynamic thresholds based on probability sequence quantiles can also be used, such as setting the threshold to a high quantile of the sequence (e.g., the 95th quantile). The adjustment factor k does not need to be set to a fixed value; instead, it can be treated as a learnable network parameter that is automatically optimized during training based on validation set performance.

[0051] Step S5: Precise Positioning The precise positioning module refines each coarse-grained timestamp to regress the precise start and end time boundaries, ultimately outputting the precise start time. Specifically, even after obtaining the coarse-grained timestamps of candidate highlight segments, the precise start and end positions of the highlight segments cannot be directly obtained. This is because the coarse-grained peak value only represents the moment of maximum event probability, while the actual highlight action typically begins several frames before that peak and ends after a certain duration.

[0052] To overcome the limitation that peak points cannot represent true boundaries, a bidirectional long short-term memory (Bi-LSTM) network is introduced for temporal modeling based on local windows. Bi-LSTM can extract cross-frame dependencies in two directions: from past to future and from future to past. This allows the network to simultaneously grasp the pre-event momentum signal and the post-event decay characteristics, providing the necessary bidirectional temporal semantics for accurate boundary regression. By reorganizing video data from the same source into a training structure with local windows and time offsets, and using Bi-LSTM for bidirectional temporal modeling, the accuracy of temporal boundary prediction can be significantly improved, allowing the model to refine the prediction from the approximate event location to the true start and end boundaries.

[0053] First, for each coarse-grained timestamp Taking it as the center, symmetrically cut off a fixed length Local fusion feature segments, i.e., intervals . The value should be slightly longer than the complete duration of a typical highlight in a basketball game (such as a dunk, a three-point shot, a block followed by a fast break, etc.), usually 3-8 seconds. In this embodiment, it is preferred to use... Set it to 8 seconds to ensure the window fully covers the core action and its preceding and following context.

[0054] Next, this local feature fragment is fed into another bidirectional LSTM (Bi-LSTM) network for boundary regression. The goal of this network is to learn to predict the highlight fragment relative to the center timestamp. starting offset and end offset During training, the start time of real events is used. With end time Constructing supervisory signals: , For samples where no real exciting events exist in the window, the monitoring signal... and Set it to 0 and calculate the loss during training so that the network learns that such windows should output an offset close to 0, thereby automatically suppressing invalid outputs caused by false detection windows during the inference phase.

[0055] The network is optimized using the L1 loss function. The loss function can also be replaced with Smooth loss. Loss is used to obtain a more stable gradient.

[0056]

[0057] in, The number of training samples. They represent the first The offset of each highlight segment from the actual start and end times of the center timestamp. These represent the start and end time offsets of the Bi-LSTM prediction, respectively.

[0058] Finally, during the inference phase, the Bi-LSTM network outputs the predicted offset. With the center timestamp By adding them together, the absolute start and end positions of the events on the original video timeline can be restored:

[0059] The absolute time point is the boundary of the video segment that can be directly used for editing, realizing the complete conversion from "offset prediction" to "operable time interval", and providing accurate time positioning basis for the subsequent automatic compilation generation module.

[0060] S6: Highlights Generation The highlight generation module edits and splices together precise event clips into a final, spectacular highlight video.

[0061] First, sort all the precise time intervals output by the precise positioning module by their start time.

[0062] Then, the sorted intervals are traversed, and adjacent intervals that overlap in time or have an interval less than a preset threshold are merged to form a longer continuous editing interval, so as to avoid repeated detection and the generation of excessively short segments.

[0063] Finally, based on these merged intervals, corresponding video clips are edited out from the original long video and spliced ​​together in chronological order to generate a complete video of highlights from the basketball game.

[0064] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.

Claims

1. A method for automatically generating highlight clips from basketball videos based on multimodal feature fusion, characterized in that, Includes the following steps: S1: Feature extraction, extracting visual feature sequences from basketball videos. With audio feature sequences And align the two in time; S2: Feature fusion. A bidirectional long short-term memory network is used to temporally encode the visual and audio feature sequences to obtain visual and audio hidden state sequences. A cross-modal attention mechanism is then constructed, using the audio modality as the query and the visual modality as the key and value, to obtain a multimodal fused feature sequence. ; S3: Probabilistic prediction, which involves using the multimodal fusion feature sequence. Input a one-dimensional convolutional neural network and output a frame-by-frame sequence of probability of exciting events. ; S4: Coarse-grained localization, based on the probability sequence of the featured events. Adaptive threshold Perform peak lookup to obtain coarse-grained timestamps of candidate highlights. ; S5: Precise positioning, with each of the aforementioned coarse-grained timestamps Extracting local fusion feature fragments from the center and inputting them into a boundary regression network, the network predicts the most prominent fragments relative to the center timestamp. starting offset and end offset This allows us to obtain the precise start and end time boundaries on the original video timeline. ; S6: Highlights generation: Based on the precise start and end time boundaries, clips are edited from the original basketball video and spliced ​​together to generate a highlight video.

2. The method according to claim 1, characterized in that, In S1, the extraction of visual feature sequences further includes: using a ResNet-50 network pre-trained on ImageNet as the base model, fine-tuning it using a constructed basketball dataset, removing the final fully connected layer, and extracting the 2048-dimensional feature vector output by the global average pooling layer as the visual features of each frame.

3. The method according to claim 1, characterized in that, In S1, the extraction of audio feature sequences further includes: uniformly resampling the audio into a 22,050Hz mono signal. According to the video frame rate Frames per second determine the length of time corresponding to a single frame. Seconds, set the step size of the audio sliding window analysis to The analysis window length is selected as the number of sampling points that is close to the step size and is an integer power of 2. For each frame of audio, 13-dimensional Mel frequency cepstral coefficients and their 13-dimensional first-order difference and 13-dimensional second-order difference are extracted to obtain a 39-dimensional audio feature vector.

4. The method according to claim 1, characterized in that, In S2, the construction of a cross-modal attention mechanism with audio modality as query and visual modality as key and value further includes: At time step i, through the linear projection matrix , Get the query vector , bond Sum matrix ,in, The hidden state matrix of the visual sequence at all time steps; The query vector is calculated using the scaled dot product attention method. The matrix composed of all visual key vectors similarity ,in, The dimension of the key vector; Attention weights are obtained by normalization using the softmax function. ,in, ; The visual value vector sequence is weighted and summed according to attention weights to obtain the fused cross-modal features. .

5. The method according to claim 1, characterized in that, In S3, the frame-by-frame binary label is adopted. As a monitoring signal, 0 indicates that the frame is in a normal event interval, and 1 indicates that the frame is in any exciting event interval; a binary cross-entropy loss function is used. Perform end-to-end training on the network.

6. The method according to claim 1, characterized in that, In S4, the adaptive threshold By calculating probability sequences global mean with standard deviation Build: ,in, This is an adjustment factor, with a value range of 0.5-2.

7. The method according to claim 1, characterized in that, In S4, the peak search further includes: selecting peak values ​​with a probability value higher than a threshold. Furthermore, the time point that has the maximum value within its local neighborhood is marked as a coarse-grained timestamp for the candidate highlight segment. .

8. The method according to claim 1, characterized in that, In S5, the extraction of local fusion feature fragments further includes: using coarse-grained timestamps. A fixed length is symmetrically cut off from the center. interval The characteristics, among which, The value is 8 seconds.

9. The method according to claim 1, characterized in that, In S5, the boundary regression network is a bidirectional long short-term memory network, employing the L1 loss function. Optimize: in, The number of training samples. They represent the first The offset of each highlight segment from the actual start and end times of the center timestamp. These represent the start and end time offsets of the Bi-LSTM prediction, respectively.

10. A system for automatically generating basketball video highlights based on multimodal feature fusion, characterized in that, include: The feature extraction module is used to extract visual feature sequences and audio feature sequences from basketball videos and align them in time. The feature fusion module is used to perform temporal encoding on the visual feature sequence and the audio feature sequence using a bidirectional long short-term memory network to obtain a visual hidden state sequence and an audio hidden state sequence, and to construct a cross-modal attention mechanism with audio modality as query and visual modality as key and value to obtain a multimodal fused feature sequence. The probability prediction module is used to input the multimodal fused feature sequence into a one-dimensional convolutional neural network and output a frame-by-frame probability sequence of exciting events. The coarse-grained positioning module is used to perform peak search based on the adaptive threshold of the probability sequence of the exciting events, and obtain the coarse-grained timestamp of the candidate exciting segments; The precise positioning module is used to extract local fusion feature segments centered on each of the coarse-grained timestamps, input them into the boundary regression network, predict the start and end offsets of the exciting segments relative to the center timestamp, and thus obtain the precise start and end time boundaries on the original video timeline. The highlight generation module is used to cut segments from the original basketball video according to the precise start and end time boundaries and splice them together to generate a highlight video.