An audio-sparseness perception and semantic guidance based audio-visual event localization method

By performing frame-by-frame processing and multi-dimensional statistical feature analysis on audio signals, combined with visual attention enhancement and silent frame restoration, the problems of semantic loss and cross-modal alignment in sparse sample processing are solved, and high-precision audiovisual event localization is achieved.

CN122196693APending Publication Date: 2026-06-12LIAONING UNIVERSITY OF TECHNOLOGY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
LIAONING UNIVERSITY OF TECHNOLOGY
Filing Date
2026-04-01
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies suffer from problems such as semantic loss of sparse samples, interference from silent frames, insufficient cross-modal feature alignment accuracy, and poor robustness of time-series modeling when processing audio and video signals with uneven temporal distribution.

Method used

By segmenting audio signals into frames, extracting multidimensional statistical features, constructing a sparsity score and dividing samples, employing dense and sparse branching processing, combining visual attention enhancement and silent frame restoration, utilizing a bidirectional long short-term memory network and a cross-modal interaction module for feature fusion, and mapping to a pre-trained joint semantic space for semantic alignment, the final output of the category and temporal boundaries of audiovisual events is achieved through a classifier.

Benefits of technology

It effectively suppresses background noise, preserves key semantics in sparse scenes, improves the accuracy and robustness of cross-modal semantic alignment, enhances the accuracy and stability of audiovisual event localization, and adapts to complex heterogeneous environments.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122196693A_ABST
    Figure CN122196693A_ABST
Patent Text Reader

Abstract

The application provides an audio-visual event positioning method based on audio sparsity perception and semantic guidance, and relates to the technical field of artificial intelligence and multi-modal perception, and the method comprises the following steps: audio multi-dimensional feature extraction and signal sparsity; feature cross-sample robust normalization processing; construction of an audio sparsity scoring function and sample classification; execution of double-branch differential feature enhancement based on sparsity perception; execution of intelligent repair and weighted fusion of sparse audio; single-modal time series modeling and double-layer cross-modal deep interaction; semantic consistency modeling based on a CLIP pre-training space; multi-task joint loss constraint and model optimization; audio-visual event prediction and classification output; the method introduces an audio sparsity analysis mechanism, implements differential branch processing on samples with different distribution characteristics, effectively solves the semantic imbalance problem in a sparse scene, and designs an audio enhancement strategy combining periodic filling and linear interpolation, thereby guaranteeing the time series continuity of the signal.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of artificial intelligence and multimodal perception technology, and in particular to a method for locating audiovisual events based on audio sparsity perception and semantic guidance. Background Technology

[0002] With the rapid development of multimedia processing technology and artificial intelligence, Audio-Visual Event Localization (AVE) has become one of the core technologies in the field of cross-modal perception, aiming to establish a high-precision correlation between audio signals and video images in the temporal dimension. In complex audio-visual scenarios, how to effectively fuse heterogeneous modal information to achieve accurate event boundary delineation and category recognition is currently a hot research topic.

[0003] However, in real-world audiovisual positioning applications, traditional cross-modal fusion methods face significant challenges.

[0004] On the one hand, existing technologies ignore the significant heterogeneity of audio signals along the time axis. Real-world audio contains both continuous, dense signals and a large amount of intermittent, transient, sparse signals. Traditional feature extraction and fusion strategies often apply a uniform processing logic to all samples, failing to identify and specifically address "silent frames" or "low-energy gaps" in the audio. This leads to irrelevant background noise features being introduced into the fusion process when processing sparse samples, diluting crucial semantic information and resulting in an imbalance in the semantic distribution between audio and video branches.

[0005] On the other hand, the depth and semantic alignment mechanisms for cross-modal interactions remain imperfect. Existing methods mostly rely on simple attention mechanisms or single-layer cross-modal projections, making it difficult to capture the complex global temporal dependencies within audio and video sequences. Without high-level semantic space supervision, models are prone to "modal shift" under complex background interference, meaning they cannot accurately determine the degree of matching between video frames and the global semantics of the audio. Furthermore, for damaged or sparse audio signals, existing technologies lack effective feature completion and enhancement methods, often leading to interruptions in temporal continuity and thus affecting the model's generalization ability in long sequence tasks. Summary of the Invention

[0006] This invention proposes an audiovisual event localization method based on audio sparsity perception and semantic guidance to solve the problems of sparse sample semantic loss, silent frame interference, insufficient cross-modal feature alignment accuracy, and poor robustness of temporal modeling in complex heterogeneous environments when processing audio and video signals with uneven temporal distribution.

[0007] This invention provides a method for locating audiovisual events based on audio sparsity perception and semantic guidance. The method includes the following steps:

[0008] Step S1: After performing frame-by-frame processing on the audio signal of the input video, extract multidimensional statistical features describing the temporal distribution characteristics of the audio signal;

[0009] Step S2: Construct an audio sparsity score based on the multidimensional statistical features, and classify the audio samples into sparse audio samples or dense audio samples according to the audio sparsity score.

[0010] Step S3: Guided by the audio features of the dense audio samples, attention weighting is applied to the visual features of the corresponding video frames to obtain enhanced dense visual features, while retaining the original audio features of the dense audio samples.

[0011] Step S4: Perform silent frame detection and repair processing on the audio features of the sparse audio samples to obtain the repaired sparse audio features, and extract the original visual features of the video frames corresponding to the sparse audio samples.

[0012] Step S5: Input the enhanced dense visual features and the corresponding original audio features, as well as the original visual features and the repaired sparse audio features, into a bidirectional long short-term memory network for single-modal temporal modeling, and perform bidirectional feature fusion through a cross-modal interaction module to obtain multimodal fused features;

[0013] Step S6: Map the multimodal fusion features to the pre-trained joint semantic space to obtain the semantic similarity between audio features and visual features, and reweight the features based on the semantic similarity to obtain semantically aligned features;

[0014] Step S7: Input the semantically aligned features into the classifier and output the category of the audiovisual event and its temporal boundary.

[0015] Furthermore, the specific method for extracting multidimensional statistical features describing the temporal distribution characteristics of the audio signal after performing frame-by-frame processing on the audio signal of the input video in step S1 includes:

[0016] The audio signal is extracted from the input video file and converted to WAV format. The audio sampling rate is set to 16kHz, and a mono signal is used. Then, a sliding window is used to process the audio signal into frames, with the frame length set to 25ms and the frame shift set to 10ms. The Hamming window function is used to window each frame.

[0017] Let the audio signal of the t-th frame be x. t (n), where n=1,2,...,N represents the intra-frame sampling point index, and N represents the number of sampling points per frame. Then the short-time energy of the t-th frame is:

[0018] ,

[0019] in: x represents the short-time energy of frame t; t (n) represents the audio amplitude of the nth sample point in the t-th frame;

[0020] Six statistical features are extracted from the audio frame sequence to describe its temporal distribution characteristics: energy coverage, active segment interval, energy change rate, peak density, RMS coefficient of variation, and zero crossover rate standard deviation. These features are used to describe the sparse and dense distribution characteristics of the audio signal in the time dimension.

[0021] Furthermore, before constructing the audio sparsity score based on the multidimensional statistical features in step S2, cross-sample normalization is first performed on the six statistical features:

[0022] set up Indicates the first The original feature values ​​of each feature dimension. This indicates that the feature is in the sample set. Quantiles This indicates that the feature is in the sample set. Quantiles are then calculated using normalization:

[0023] ;

[0024] Map the features to a range of 0 to 1;

[0025] Values ​​exceeding the range are truncated:

[0026] ;

[0027] Wherein: F i norm This represents the normalized eigenvalues.

[0028] Furthermore, the specific method for constructing an audio sparsity score based on the multidimensional statistical features in step S2, and for classifying audio samples into sparse audio samples or dense audio samples according to the audio sparsity score, includes:

[0029] Construct an audio sparsity scoring function:

[0030] Perform inverse mapping for features that are sparser with smaller values: ;

[0031] Then, the multidimensional features are fused and calculated:

[0032]

[0033] in, Represents audio sparsity scoring; based on a scoring threshold The samples are divided into sparse and dense samples, and corresponding sparsity masks are generated.

[0034] Furthermore, in step S3, guided by the audio features of the dense audio samples, attention-weighted visual features of the corresponding video frames are applied to obtain enhanced dense visual features, while retaining the original audio features of the dense audio samples. The specific method includes:

[0035] The dense audio samples contain continuous audio information and valid event signals, so an audio information-guided visual attention mechanism is used to enhance visual features.

[0036] First, the video frames are encoded using a visual feature extraction network;

[0037] The visual coding network adopts the ResNet-50 structure, with an input video frame size of 224×224 pixels, and extracts high-dimensional visual features for each frame; the corresponding audio feature representation is extracted through the audio coding network, and the audio encoder adopts the VGGish network structure; then the attention weights of the visual features are calculated using the audio features as guiding signals to obtain the enhanced dense visual features.

[0038] Further, the specific method for performing silent frame detection and repair processing on the audio features of the sparse audio samples in step S4 to obtain the repaired sparse audio features, and extracting the original visual features of the video frames corresponding to the sparse audio samples, includes:

[0039] The sparse audio samples contain silent frames in their audio sequences. The original visual features of the video frames in the sparse audio samples are extracted using a visual coding network.

[0040] For the detected silent frames, the audio sequence is first analyzed using a period search algorithm based on the autocorrelation function. The period of the audio signal is estimated by finding the peak position in the autocorrelation function, and the period information is used to fill in the missing frames.

[0041] Let the audio signal be Its autocorrelation function is:

[0042] ,

[0043] in, Indicates delay The autocorrelation value at time;

[0044] When a stable cycle is detected At that time, the execution cycle is completed:

[0045] ,

[0046] When there is no obvious periodic structure in the audio signal, a linear interpolation method is used to complete the silent frame by using adjacent valid frames:

[0047] Let the original audio features be... Repair audio features as The fusion weight is The enhanced audio features are: .

[0048] When multiple consecutive frames are silent and cannot be recovered by interpolation, the average value of the features of the non-silent frames is used to fill in the gaps, resulting in the repaired sparse audio features.

[0049] After restoration, dynamic weighting coefficients are introduced to perform weighted fusion of the original audio features and the restored audio features, resulting in restored sparse audio features:

[0050] Let the original audio features be Repair audio features as The fusion weight is The enhanced audio features are:

[0051] .

[0052] Further, the specific method for inputting the enhanced dense visual features and the corresponding original audio features, as well as the original visual features and the repaired sparse audio features, into a bidirectional long short-term memory network for single-modal temporal modeling in step S5, and then performing bidirectional feature fusion through a cross-modal interaction module to obtain multimodal fused features includes:

[0053] The Bi-LSTM bidirectional long short-term memory network consists of two layers, each with 512 hidden units, and is used to capture contextual dependencies over a long period of time to complete single-modal temporal modeling.

[0054] The enhanced dense visual features and their corresponding original audio features, as well as the original visual features and their repaired sparse audio features, are respectively input into a bidirectional long short-term memory network for single-modal temporal modeling.

[0055] , ;

[0056] in: This represents the encoding function of a temporal neural network; Represents audio temporal characteristics; Indicates visual temporal characteristics;

[0057] Audio and visual features are input into the cross-modal interaction module for deep fusion;

[0058] A multi-layer cross-attention mechanism is used to achieve bidirectional information interaction between audio and vision, and between vision and audio. Residual connections and layer normalization operations are introduced in each layer to obtain multimodal fusion features.

[0059] Further, in step S6, the multimodal fusion features are mapped to a pre-trained joint semantic space to obtain the semantic similarity between audio and visual features. The features are then reweighted based on this semantic similarity to obtain semantically aligned features. The specific method for this is as follows:

[0060] The audio features in the multimodal fusion features are averaged and then projected onto the shared semantic space of the CLIP model through a linear mapping layer to obtain the global audio embedding.

[0061] The audio features are then averaged and pooled. ;

[0062] The visual features in the multimodal fusion features are projected onto the shared semantic space of the CLIP model through an adaptation layer to obtain a visual frame-level embedding.

[0063] Calculate the cosine similarity between the global audio embedding and the visual frame-level embedding for each frame:

[0064]

[0065] in: Representing vectors Norm, , ;

[0066] The cosine similarity is normalized into semantic matching weights using the Softmax function;

[0067] The audio and visual features in the multimodal fusion features are weighted and fused using the semantic matching weights to obtain semantically aligned features.

[0068] Furthermore, the classifier mentioned in step S7 includes:

[0069] In the fully supervised mode, a two-layer fully connected network is used to directly output the event category probability of each frame;

[0070] In the weakly supervised mode, dynamic weighted branching is used to model the saliency of different categories, and the video-level event classification results and corresponding time positioning intervals are output.

[0071] Compared with the prior art, the present invention has the following advantages:

[0072] 1. Existing technologies employ a uniform feature extraction and fusion strategy for all audio samples, ignoring the heterogeneity of audio signals along the time axis. This leads to background noise from sparse samples being introduced into the fusion process, diluting key semantic information. This invention quantifies the sparsity of audio signals by constructing multidimensional statistical features, and accordingly divides samples into sparse and dense categories, sending them to differentiated branches for processing. The dense branch utilizes audio to guide visual attention and enhance key information, while the sparse branch avoids ineffective attention interference. This achieves accurate adaptation to samples with different distribution characteristics, effectively suppressing background noise and preserving key semantics in sparse scenes.

[0073] 2. To address the semantic gaps caused by silent frames in sparse audio, existing technologies lack effective feature completion methods, making it difficult to guarantee temporal continuity. This invention proposes a three-level progressive audio restoration strategy: firstly, using autocorrelation function to search for signal periods for restoration; secondly, employing linear interpolation when no significant period is found; and finally, using the mean of non-silent frames as a fallback solution. This strategy can adaptively select the optimal restoration method based on the actual characteristics of the audio signal, ensuring the temporal integrity of the audio feature sequence and providing high-quality audio representation for subsequent cross-modal fusion.

[0074] 3. Existing cross-modal fusion methods often rely on simple attention mechanisms or single-layer projection, making it difficult to capture the complex global temporal dependencies within audio and video sequences. They are also prone to modal shifts under background interference. This invention maps audio and video features to the shared semantic space of the CLIP pre-trained model. By calculating semantic similarity to generate matching weights, features are reweighted, allowing modal information with higher semantic consistency to receive higher weights during the fusion process. Simultaneously, a two-layer cross-attention mechanism enables deep interaction and bidirectional information penetration within modalities, significantly improving the accuracy and robustness of cross-modal semantic alignment.

[0075] 4. This invention constructs a multi-task joint loss function that includes classification loss, audiovisual synchronization assistance loss, semantic consistency loss, and knowledge distillation loss. During the training phase, it simultaneously optimizes event classification accuracy, temporal synchronization, and semantic alignment capabilities. This multi-task collaborative mechanism enables the model to maintain stable localization performance in complex heterogeneous audiovisual environments, while supporting both fully supervised and weakly supervised working modes, demonstrating good practical value and deployment flexibility.

[0076] Based on the implementation methods provided in the above aspects, this application can be further combined to provide more implementation methods. Attached Figure Description

[0077] The above and other objects, features, and advantages of exemplary embodiments of the present invention will become readily apparent upon reading the following detailed description with reference to the accompanying drawings. In the drawings, several embodiments of the invention are illustrated by way of example and not limitation, with the same or corresponding reference numerals denoteing the same or corresponding parts, wherein:

[0078] Figure 1 The flowchart shows an audiovisual event localization method based on audio sparsity perception and multidimensional semantic enhancement provided in an embodiment of the present invention.

[0079] Figure 2 This is a schematic diagram of the structure of an audiovisual event localization system based on audio sparsity perception and multidimensional semantic enhancement, provided for an embodiment of the present invention. Detailed Implementation

[0080] The exemplary embodiments disclosed in this application will now be described in more detail with reference to the accompanying drawings. While exemplary embodiments of this application are shown in the drawings, it should be understood that this application can be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided to enable a more thorough understanding of this application and to fully convey the scope of this application to those skilled in the art. Unless otherwise specified, the technical means used in the embodiments are conventional means well known to those skilled in the art.

[0081] This embodiment provides a method for audiovisual event localization based on audio sparsity perception and semantic guidance. This method can be deployed on GPU-accelerated computing devices, such as server environments configured with Intel Xeon processors, at least 32GB of RAM, and NVIDIA RTX 3090 or equivalent GPUs. The operating environment can be Ubuntu 20.04, and the deep learning framework can be PyTorch 1.12 or later. The system input is a multimedia file containing video and audio information, and the output is the category prediction result of the audiovisual event and its corresponding temporal localization interval.

[0082] Combination Figure 1 A flowchart is provided for an audiovisual event localization method based on audio sparsity perception and multidimensional semantic enhancement. The method includes the following steps:

[0083] Step S1: After performing frame-by-frame processing on the audio signal of the input video, extract multidimensional statistical features describing the temporal distribution characteristics of the audio signal;

[0084] Step S2: Construct an audio sparsity score based on the multidimensional statistical features, and classify the audio samples into sparse audio samples or dense audio samples according to the audio sparsity score.

[0085] Step S3: Guided by the audio features of the dense audio samples, attention weighting is applied to the visual features of the corresponding video frames to obtain enhanced dense visual features, while retaining the original audio features of the dense audio samples.

[0086] Step S4: Perform silent frame detection and repair processing on the audio features of the sparse audio samples to obtain the repaired sparse audio features, and extract the original visual features of the video frames corresponding to the sparse audio samples.

[0087] Step S5: Input the enhanced dense visual features and the corresponding original audio features, as well as the original visual features and the repaired sparse audio features, into a bidirectional long short-term memory network for single-modal temporal modeling, and perform bidirectional feature fusion through a cross-modal interaction module to obtain multimodal fused features;

[0088] Step S6: Map the multimodal fusion features to the pre-trained joint semantic space to obtain the semantic similarity between audio features and visual features, and reweight the features based on the semantic similarity to obtain semantically aligned features;

[0089] Step S7: Input the semantically aligned features into the classifier and output the category of the audiovisual event and its temporal boundary.

[0090] Optionally, the specific method for extracting multidimensional statistical features describing the temporal distribution characteristics of the audio signal after performing frame segmentation processing on the audio signal of the input video in step S1 includes:

[0091] The audio signal is extracted from the input video file and converted to WAV format. The audio sampling rate is set to 16kHz, and a mono signal is used. Then, a sliding window is used to process the audio signal into frames, with the frame length set to 25ms and the frame shift set to 10ms. The Hamming window function is used to window each frame.

[0092] Let the audio signal of the t-th frame be x. t (n), where n=1,2,...,N represents the intra-frame sampling point index, and N represents the number of sampling points per frame. Then the short-time energy of the t-th frame is:

[0093] ,

[0094] in: x represents the short-time energy of frame t; t (n) represents the audio amplitude of the nth sample point in the t-th frame;

[0095] Six statistical features are extracted from the audio frame sequence to describe its temporal distribution characteristics: energy coverage, active segment interval, energy change rate, peak density, RMS coefficient of variation, and zero crossover rate standard deviation. These features are used to describe the sparse and dense distribution characteristics of the audio signal in the time dimension.

[0096] Specifically, the system first extracts the audio signal from the input video file and converts it to WAV format. The audio sampling rate is set to 16kHz, using a mono signal. Then, a sliding window is used to segment the audio signal into frames, with a frame length of 25ms and a frame shift of 10ms. A Hamming window function is used to window each frame. After segmentation, the system calculates six statistical features for each audio sample to describe its temporal distribution characteristics. Specifically, the system first calculates the audio frame energy sequence and, based on an energy threshold, calculates the proportion of high-energy frames to all frames, thus obtaining the energy coverage. Then, based on the active segments formed by consecutive high-energy frames, it calculates the average time interval between adjacent active segments to obtain the active segment interval feature. Next, it calculates the average absolute value of the energy difference between adjacent frames to obtain the energy change rate, used to describe the audio energy fluctuation. Then, it uses a peak detection algorithm to count the number of energy peaks per unit time, thus obtaining the peak density. Further, it calculates the ratio of the standard deviation to the mean of the audio frame RMS energy sequence to obtain the RMS coefficient of variation. Finally, it calculates the zero-crossing rate and calculates its standard deviation for each frame to obtain the zero-crossing rate standard deviation feature. Through the above feature extraction process, the sparse or dense distribution of audio signals on the time axis can be characterized from multiple dimensions such as energy distribution, time structure, and signal oscillation stability.

[0097] Optionally, step S2: the specific method for constructing an audio sparsity score based on the multidimensional statistical features, and classifying audio samples into sparse audio samples or dense audio samples according to the audio sparsity score includes:

[0098] Before constructing the audio sparsity score based on the multidimensional statistical features in step S2, cross-sample normalization is first performed on the six statistical features:

[0099] set up Indicates the first The original feature values ​​of each feature dimension. This indicates that the feature is in the sample set. Quantiles This indicates that the feature is in the sample set. Quantiles are then calculated using normalization:

[0100] ;

[0101] Map the features to a range of 0 to 1;

[0102] Values ​​exceeding the range are truncated:

[0103] ;

[0104] Wherein: F i norm This represents the normalized eigenvalues.

[0105] Construct an audio sparsity scoring function:

[0106] Perform inverse mapping for features that are sparser with smaller values: ;

[0107] Then, the multidimensional features are fused and calculated:

[0108]

[0109] in, Represents audio sparsity scoring; based on a scoring threshold The samples are divided into sparse and dense samples, and corresponding sparsity masks are generated.

[0110] Specifically, after extracting the six-dimensional features of all samples, to avoid the impact of differences in the units of different features on subsequent calculations, it is necessary to unify the scale of the features. In this embodiment, the system first statistically analyzes the distribution of all training samples in each feature dimension and calculates the 10th and 90th quantiles for each dimension. Then, a normalization method based on quantile mapping is used to scale the features, mapping them to the range of 0 to 1. For values ​​exceeding the range, a truncation operation is used to limit them to between 0 and 1. This method is more robust to outliers than traditional extreme value normalization, thus ensuring the stability of the subsequent feature fusion process. The system first performs directional unification processing on the normalized six-dimensional features. For features where smaller values ​​indicate sparser data, a reverse mapping method is used to transform them, unifying all features in a numerical sense to "larger values ​​indicate denser audio". Then, the six features are fused with equal weights to calculate the sparsity score of the audio samples. The system sets a threshold based on the statistical distribution of the scores in the training data; in this embodiment, the 60th quantile is selected as the judgment threshold. When a sample score is below the threshold, it is classified as a sparse audio sample and marked as 0; when the score is above or equal to the threshold, it is classified as a dense audio sample and marked as 1. The resulting sparsity mask is used to guide subsequent differential feature enhancement processing.

[0111] Optionally, in step S3, the visual features of the corresponding video frame are weighted by attention based on the audio features of the dense audio sample to obtain enhanced dense visual features, while retaining the original audio features of the dense audio sample.

[0112] Specifically, the system processes samples into dense and sparse branches based on the aforementioned sparsity mask. For audio samples deemed dense, since their audio information is continuous and contains valid event signals, the system employs an audio-guided visual attention mechanism to enhance visual features. In this process, video frames are first encoded using a visual feature extraction network. In this embodiment, the visual encoding network uses a ResNet-50 structure, with an input video frame size of 224×224 pixels, and extracts high-dimensional visual features for each frame. Simultaneously, the corresponding audio feature representation is extracted using an audio encoding network; in this embodiment, the audio encoder uses a VGGish network structure. Subsequently, the system uses the audio features as a guiding signal to calculate the attention weights of the visual features, enabling the model to focus more on visual regions related to audio events, thereby enhancing the expressive power of visual features. For audio samples deemed sparse, since their audio sequences contain a large number of silent or invalid frames, to avoid erroneous attention guidance causing visual semantic shifts, the system only extracts the original visual projection features output by the visual encoding network in this branch, without performing audio-guided attention enhancement.

[0113] Optionally, in step S4, the audio features of the sparse audio samples are subjected to silent frame detection and repair processing to obtain the repaired sparse audio features, and the original visual features of the video frames corresponding to the sparse audio samples are extracted.

[0114] The sparse audio samples contain silent frames in their audio sequences. The original visual features of the video frames in the sparse audio samples are extracted using a visual coding network.

[0115] For the detected silent frames, the audio sequence is first analyzed using a period search algorithm based on the autocorrelation function. The period of the audio signal is estimated by finding the peak position in the autocorrelation function, and the period information is used to fill in the missing frames.

[0116] Let the audio signal be Its autocorrelation function is:

[0117] ,

[0118] in, Indicates delay The autocorrelation value at time;

[0119] When a stable cycle is detected At that time, the execution cycle is completed:

[0120] ,

[0121] When there is no obvious periodic structure in the audio signal, a linear interpolation method is used to complete the silent frame by using adjacent valid frames:

[0122] ,

[0123] Let the original audio features be Repair audio features as The fusion weight is The enhanced audio features are: .

[0124] When multiple consecutive frames are silent and cannot be recovered by interpolation, the average value of the features of the non-silent frames is used to fill in the gaps, resulting in the repaired sparse audio features.

[0125] After restoration, dynamic weighting coefficients are introduced to perform weighted fusion of the original audio features and the restored audio features, resulting in restored sparse audio features:

[0126] Let the original audio features be Repair audio features as The fusion weight is The enhanced audio features are:

[0127] .

[0128] Specifically, in the sparse branch, the system performs repair processing on the audio sequence to address the semantic gaps caused by silent frames. First, silent frames are identified by calculating the short-time energy of each frame; a frame is considered silent if its energy is below a preset threshold. For detected silent frames, the system prioritizes using a periodic search algorithm based on the autocorrelation function to analyze the audio sequence. The signal period is estimated by finding the peak position in the autocorrelation function, and the periodic information is used to fill in missing frames. When there is no obvious periodic structure in the audio signal, linear interpolation is used to fill in the silent frames using adjacent valid frames to ensure the temporal continuity of the audio feature sequence. If multiple consecutive frames are silent and cannot be recovered by interpolation, the average value of the features from non-silent frames is used for further filling. After repair, the system introduces dynamic weighting coefficients to weight and fuse the original and repaired audio features, thereby improving the integrity of the audio sequence while maintaining the authenticity of the original signal.

[0129] Optionally, in step S5, the enhanced dense visual features and the corresponding original audio features, as well as the original visual features and the repaired sparse audio features, are respectively input into a bidirectional long short-term memory network for single-modal temporal modeling, and bidirectional feature fusion is performed through a cross-modal interaction module to obtain multimodal fused features.

[0130] The Bi-LSTM bidirectional long short-term memory network consists of two layers, each with 512 hidden units, and is used to capture contextual dependencies over a long period of time to complete single-modal temporal modeling.

[0131] The enhanced dense visual features and their corresponding original audio features, as well as the original visual features and their repaired sparse audio features, are respectively input into a bidirectional long short-term memory network for single-modal temporal modeling.

[0132] , ;

[0133] in: This represents the encoding function of a temporal neural network; Represents audio temporal characteristics; Indicates visual temporal characteristics;

[0134] Audio and visual features are input into the cross-modal interaction module for deep fusion;

[0135] A multi-layer cross-attention mechanism is used to achieve bidirectional information interaction between audio and vision, and between vision and audio. Residual connections and layer normalization operations are introduced in each layer to obtain multimodal fusion features.

[0136] Specifically, the system inputs audio and visual feature sequences into a bidirectional long short-term memory (Bi-LSTM) network for temporal modeling. In this embodiment, the Bi-LSTM network comprises a two-layer structure with 512 hidden units per layer, used to capture contextual dependencies over a long time span. After completing unimodal encoding, the system deeply fuses the audio and visual feature inputs into a cross-modal interaction module. This module achieves bidirectional information interaction between audio and vision, and between vision and audio, through a multi-layer cross-attention mechanism. Residual connections and layer normalization operations are introduced in each layer to improve the stability of the model training process.

[0137] Optionally, in step S6, mapping the multimodal fusion features to a pre-trained joint semantic space to obtain the semantic similarity between audio and visual features, and reweighting the features based on the semantic similarity to obtain semantically aligned features, the specific method includes:

[0138] The audio features in the multimodal fusion features are averaged and then projected onto the shared semantic space of the CLIP model through a linear mapping layer to obtain the global audio embedding.

[0139] The audio features are then averaged and pooled. ;

[0140] The visual features in the multimodal fusion features are projected onto the shared semantic space of the CLIP model through an adaptation layer to obtain a visual frame-level embedding.

[0141] Calculate the cosine similarity between the global audio embedding and the visual frame-level embedding for each frame:

[0142]

[0143] in: Representing vectors Norm, , ;

[0144] The cosine similarity is normalized into semantic matching weights using the Softmax function;

[0145] The audio and visual features in the multimodal fusion features are weighted and fused using the semantic matching weights to obtain semantically aligned features.

[0146] Specifically, the system first performs average pooling on the audio temporal features to obtain a global audio semantic representation, and then projects it onto the CLIP shared semantic space through a linear mapping. Simultaneously, the system maps visual frame-level features to the same semantic space through an adaptation layer. Subsequently, the cosine similarity between audio and visual features is calculated and normalized using a softmax function to obtain semantic matching weights. The system uses these weights to reweight the audio and visual features, giving higher weight to modal information with higher semantic consistency during the fusion process, thereby improving cross-modal semantic alignment capabilities.

[0147] During model training, the system constructs a multi-task joint loss function to optimize the model. This loss function includes a classification loss to measure the accuracy of category prediction, an audiovisual localization assistance loss to constrain the temporal synchronization of audiovisual events, and a CLIP similarity supervision loss to enhance semantic consistency. Furthermore, a teacher model is introduced to supervise knowledge distillation of visual features, improving the feature representation ability of the student model through distillation loss. The system jointly optimizes the above loss functions using a backpropagation algorithm, thereby comprehensively improving the model's classification ability, localization accuracy, and cross-modal semantic consistency.

[0148] Optionally, the specific method for inputting the semantically aligned features into the classifier in step S7 and outputting the category of the audiovisual event and its temporal boundary includes:

[0149] The system inputs the fused multimodal features into the classification and prediction module to generate event recognition results. In fully supervised training mode, the system uses a two-layer fully connected network to directly output the event category probability for each frame, thereby achieving frame-level audiovisual event localization and classification. In weakly supervised mode, the system introduces dynamic weighted branches to model the saliency of different categories and outputs video-level event classification results and corresponding temporal localization intervals.

[0150] Figure 2 This is a schematic diagram of the structure of an audiovisual event localization system based on audio sparsity perception and multidimensional semantic enhancement, provided as an embodiment of the present invention. Figure 2 As shown, the system includes an audio feature extraction module 201, a feature normalization module 202, a sparsity scoring and classification module 203, a dual-branch differential enhancement module 204, an intelligent audio restoration module 205, a temporal modeling and cross-modal interaction module 206, a semantic consistency modeling module 207, a multi-task optimization module 208, and an event prediction output module 209; among them, a sparsity analysis module is also set at the location of the audio feature extraction module 201; the above modules are interconnected through data interfaces and are configured according to... Figure 1 The processes shown are executed sequentially to form a complete audiovisual event localization system.

[0151] Example

[0152] This invention proposes a method for audiovisual event localization based on audio sparsity perception and semantic guidance. By performing refined sparsity classification of audio and video signals, dual-branch differential enhancement, and semantic consistency modeling based on CLIP pre-training space, this invention demonstrates significant superiority in localization accuracy and robustness in complex audiovisual scenarios.

[0153] To verify the effectiveness of this invention, the publicly available standard dataset AVE was used for evaluation. This dataset contains 4,143 real-world video samples, covering 28 event categories, including human activities, animal behavior, musical instrument performances, and traffic scenes. Each video is evenly divided into 10 one-second segments, providing segment-level temporal boundary annotations and video-level category labels. Following the official division, 3,339 videos were used for training, and the remaining samples were evenly divided into validation and test sets. The experiment used overall localization accuracy as the evaluation metric and constructed a time-aligned audiovisual multimodal representation. In terms of feature extraction, a global-local hybrid coding framework is adopted on the visual side: global semantics are extracted by the CLIP ViT-L / 14 encoder (with frozen parameters) proposed by OpenAI, and local region features are obtained by the Top-5 RoIs of Faster R-CNN (ResNet-50-FPN) pre-trained on COCO and the relationship is modeled by multi-head GAT. Finally, they are fused into a 10×512 video-level visual representation. On the audio side, the audio is uniformly resampled to 32kHz, and a 2048-dimensional embedding is extracted by the pre-trained PANNS CNN14 to form a 10×2048 time series feature, which provides a foundation for subsequent sparsity perception and differential enhancement.

[0154] Under the aforementioned unified testing conditions, this invention achieves the current best result on the standard test set:

[0155] First, this invention significantly overcomes the performance bottleneck of existing technologies in terms of localization accuracy under both fully supervised and weakly supervised conditions. Experimental data shows that when processing the AVE dataset, the fully supervised accuracy reaches 84.6%, and the weakly supervised accuracy reaches 81.3%. Compared to AVSGN (CVPR 2025), which performs well among existing technologies, this invention improves accuracy by 1.5% and 2.5% in both metrics, respectively; and compared to the state-of-the-art ESIP model from 2023, the improvements are even greater, reaching 3.2% and 3.9%, respectively. This strongly demonstrates that the sparse perception mechanism proposed in this invention can more effectively capture event boundaries and, in weakly supervised environments with scarce label information, exhibits stronger perception capabilities and localization accuracy than existing technologies through intelligent feature completion and semantic guidance.

[0156] Second, the cross-modal semantic alignment capability and model generalization performance are superior to existing large-scale backbone network solutions. Existing technologies such as PROMIA (ICML 2025), while employing a massively large Swin-V2-L backbone network, suffer from a fully supervised performance of only 79.3% due to a lack of deep modeling of signal distribution heterogeneity. In contrast, this invention combines multi-scale features of CLIP-ViT-L14 and ResNet-50 on the visual side, and through a dual-layer cross-attention and semantic reweighting mechanism, achieves a 5.3% lead over PROMIA in fully supervised metrics. This demonstrates that this invention not only effectively bridges the semantic gap between audio and video modalities but also significantly outperforms existing solutions that rely solely on increasing the number of parameters in terms of feature utilization efficiency and adaptability to complex scenarios, possessing extremely high practical application value.

[0157] Table 1 Performance Comparison of Audiovisual Event Localization Models (AVE Dataset)

[0158]

[0159] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A method for locating audiovisual events based on audio sparsity perception and semantic guidance, characterized in that, The positioning method includes the following steps: Step S1: After performing frame-by-frame processing on the audio signal of the input video, extract multidimensional statistical features describing the temporal distribution characteristics of the audio signal; Step S2: Construct an audio sparsity score based on the multidimensional statistical features, and classify the audio samples into sparse audio samples or dense audio samples according to the audio sparsity score. Step S3: Guided by the audio features of the dense audio samples, attention weighting is applied to the visual features of the corresponding video frames to obtain enhanced dense visual features, while retaining the original audio features of the dense audio samples. Step S4: Perform silent frame detection and repair processing on the audio features of the sparse audio samples to obtain the repaired sparse audio features, and extract the original visual features of the video frames corresponding to the sparse audio samples. Step S5: Input the enhanced dense visual features and the corresponding original audio features, as well as the original visual features and the repaired sparse audio features, into a bidirectional long short-term memory network for single-modal temporal modeling, and perform bidirectional feature fusion through a cross-modal interaction module to obtain multimodal fused features; Step S6: Map the multimodal fusion features to the pre-trained joint semantic space to obtain the semantic similarity between audio features and visual features, and reweight the features based on the semantic similarity to obtain semantically aligned features; Step S7: Input the semantically aligned features into the classifier and output the category of the audiovisual event and its temporal boundary.

2. The audiovisual event localization method based on audio sparsity perception and semantic guidance according to claim 1, characterized in that, The specific method for extracting multidimensional statistical features describing the temporal distribution characteristics of the audio signal after performing frame segmentation processing on the audio signal of the input video in step S1 includes: The audio signal is extracted from the input video file and converted to WAV format. The audio sampling rate is set to 16kHz, and a mono signal is used. Then, a sliding window is used to process the audio signal into frames, with the frame length set to 25ms and the frame shift set to 10ms. The Hamming window function is used to window each frame. Let the audio signal of the t-th frame be x. t (n), where n=1,2,...,N represents the intra-frame sampling point index, and N represents the number of sampling points per frame. Then the short-time energy of the t-th frame is: , in: x represents the short-time energy of frame t; t (n) represents the audio amplitude of the nth sample point in the t-th frame; Six statistical features are extracted from the audio frame sequence to describe its temporal distribution characteristics: energy coverage, active segment interval, energy change rate, peak density, RMS coefficient of variation, and zero crossover rate standard deviation. These features are used to describe the sparse and dense distribution characteristics of the audio signal in the time dimension.

3. The audiovisual event localization method based on audio sparsity perception and semantic guidance according to claim 1, characterized in that, Before constructing the audio sparsity score based on the multidimensional statistical features in step S2, cross-sample normalization is first performed on the six statistical features: set up Indicates the first The original feature values ​​of each feature dimension. This indicates that the feature is in the sample set. Quantiles This indicates that the feature is in the sample set. Quantiles are then calculated using normalization: ; Map the features to a range of 0 to 1; Values ​​exceeding the range are truncated: ; Wherein: F i norm This represents the normalized eigenvalues.

4. A semantically guided audiovisual event localization method, characterized in that, The specific method for constructing an audio sparsity score based on the multidimensional statistical features in step S2, and classifying audio samples into sparse or dense audio samples according to the audio sparsity score, includes: Construct an audio sparsity scoring function: Perform inverse mapping for features that are sparser with smaller values: ; Then, the multidimensional features are fused and calculated: in, Represents audio sparsity scoring; based on a scoring threshold The samples are divided into sparse and dense samples, and corresponding sparsity masks are generated.

5. The audiovisual event localization method based on audio sparsity perception and semantic guidance according to claim 1, characterized in that, In step S3, guided by the audio features of the dense audio samples, attention-weighted visual features of the corresponding video frames are applied to obtain enhanced dense visual features, while retaining the original audio features of the dense audio samples. The specific method includes: The dense audio samples contain continuous audio information and valid event signals, so an audio information-guided visual attention mechanism is used to enhance visual features. First, the video frames are encoded using a visual feature extraction network; The visual coding network adopts the ResNet-50 structure, with an input video frame size of 224×224 pixels, and extracts high-dimensional visual features for each frame; the corresponding audio feature representation is extracted through the audio coding network, and the audio encoder adopts the VGGish network structure; then the attention weights of the visual features are calculated using the audio features as guiding signals to obtain the enhanced dense visual features.

6. The audiovisual event localization method based on audio sparsity perception and semantic guidance according to claim 1, characterized in that, The specific method for performing silent frame detection and repair processing on the audio features of the sparse audio samples in step S4 to obtain the repaired sparse audio features, and extracting the original visual features of the video frames corresponding to the sparse audio samples, includes: The sparse audio samples contain silent frames in their audio sequences. The original visual features of the video frames in the sparse audio samples are extracted using a visual coding network. For the detected silent frames, the audio sequence is first analyzed using a period search algorithm based on the autocorrelation function. The period of the audio signal is estimated by finding the peak position in the autocorrelation function, and the period information is used to fill in the missing frames. Let the audio signal be Its autocorrelation function is: , in, Indicates delay The autocorrelation value at time; When a stable cycle is detected At that time, the execution cycle is completed: , When there is no obvious periodic structure in the audio signal, a linear interpolation method is used to complete the silent frame by using adjacent valid frames: , Let the original audio features be Repair audio features as The fusion weight is The enhanced audio features are: ; When multiple consecutive frames are silent and cannot be recovered by interpolation, the average value of the features of the non-silent frames is used to fill in the gaps, resulting in the repaired sparse audio features. After the repair is completed, dynamic weight coefficients are introduced to perform weighted fusion of the original audio features and the repaired audio features to obtain the repaired sparse audio features.

7. The audiovisual event localization method based on audio sparsity perception and semantic guidance according to claim 1, characterized in that, The specific method for inputting the enhanced dense visual features and the corresponding original audio features, as well as the original visual features and the repaired sparse audio features, into a bidirectional long short-term memory network for single-modal temporal modeling in step S5, and then performing bidirectional feature fusion through a cross-modal interaction module to obtain multimodal fused features includes: The Bi-LSTM bidirectional long short-term memory network consists of two layers, each with 512 hidden units, and is used to capture contextual dependencies over a long period of time to complete single-modal temporal modeling. The enhanced dense visual features and their corresponding original audio features, as well as the original visual features and their repaired sparse audio features, are respectively input into a bidirectional long short-term memory network for single-modal temporal modeling. , ; in: This represents the encoding function of a temporal neural network; Represents audio temporal characteristics; It represents visual temporal characteristics. Audio and visual features are input into the cross-modal interaction module for deep fusion; A multi-layer cross-attention mechanism is used to achieve bidirectional information interaction between audio and vision, and between vision and audio. Residual connections and layer normalization operations are introduced in each layer to obtain multimodal fusion features.

8. The audiovisual event localization method based on audio sparsity perception and semantic guidance according to claim 1, characterized in that, In step S6, the multimodal fusion features are mapped to a pre-trained joint semantic space to obtain the semantic similarity between audio and visual features. The features are then reweighted based on this semantic similarity to obtain semantically aligned features. The specific method for this is as follows: The audio features in the multimodal fusion features are averaged and then projected onto the shared semantic space of the CLIP model through a linear mapping layer to obtain the global audio embedding. The audio features are then averaged and pooled. ; The visual features in the multimodal fusion features are projected onto the shared semantic space of the CLIP model through an adaptation layer to obtain a visual frame-level embedding. Calculate the cosine similarity between the global audio embedding and the visual frame-level embedding for each frame: in: Representing vectors Norm, , ; The cosine similarity is normalized into semantic matching weights using the Softmax function; The audio and visual features in the multimodal fusion features are weighted and fused using the semantic matching weights to obtain semantically aligned features.

9. The audiovisual event localization method based on audio sparsity perception and semantic guidance according to claim 1, characterized in that, The classifier mentioned in step S7 includes: In the fully supervised mode, a two-layer fully connected network is used to directly output the event category probability of each frame; In the weakly supervised mode, dynamic weighted branching is used to model the saliency of different categories, and the video-level event classification results and corresponding time positioning intervals are output.