Multimodal data processing method and device based on sound event detection model and visual detection model, equipment and medium

CN122201346APending Publication Date: 2026-06-12MALANSHAN AUDIO & VIDEO LABORATORY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
MALANSHAN AUDIO & VIDEO LABORATORY
Filing Date
2026-03-23
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In existing technologies, audio processing methods have the problem of not being able to effectively determine whether the sound corresponds to the content of the picture, which reduces the value of data training.

Method used

A multimodal collaborative verification scheme combining a sound event detection model and a visual detection model is adopted. By determining the event occurrence probability of the target event in the audio data and the object existence probability of the target object in the video data, consistency verification and annotation are performed. The start and end of the event are established using a dual threshold hysteresis mechanism. Combined with temporal smoothing technology and bounding box information from the visual detection model, high-quality multimodal training data is constructed.

🎯Benefits of technology

It improves the purity and alignment of training data, enhances the recognition accuracy and robustness of the model in complex acoustic environments, and strengthens the anti-interference ability and semantic alignment accuracy of the multimodal model in complex scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201346A_ABST
    Figure CN122201346A_ABST
Patent Text Reader

Abstract

The application discloses a multimodal data processing method and device based on a sound event detection model and a visual detection model, equipment and a medium, relates to the field of data labeling, and comprises the following steps: acquiring audio data from multimodal data, determining the event occurrence probability of a target event in the audio data to determine a target probability curve; determining the target time period during which the target event lasts in the audio data based on the target probability curve and by using a double-probability threshold, and determining the audio data corresponding to the target time period in the audio data as target audio data; acquiring target video data corresponding to the target audio data from the multimodal data, and determining the object existence probability of a target object in the target video data; performing consistency checking on the target audio data and the target video data, and labeling the target audio data and the target video data based on the obtained checking result. The application designs an intelligent labeling scheme for double-modal collaborative checking of audio and video.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data annotation, and in particular to a multimodal data processing method, apparatus, device, and medium based on sound event detection models and visual detection models. Background Technology

[0002] In recent years, artificial intelligence technology for audio processing has developed rapidly, and various audio algorithms based on deep learning have become the focus of the field of artificial intelligence.

[0003] Currently, many audio models have been engineered and implemented, leading to a large-scale demand for audio training. In the field of multimedia data annotation, relying solely on audio features for automatic data cleaning often results in misjudgments. For example, a recording might contain the sound of a car engine, but the car may not appear in the video, reducing the training value of the data for the video content. Existing automated processing methods based on a single audio modality suffer from fundamental flaws such as auditory illusions and semantic mismatches. While current sound event detection models can extract specific audio events through probability thresholds, they are essentially signal-level recognitions and cannot determine whether the sound corresponds to the video content.

[0004] Therefore, designing an intelligent annotation scheme for dual-modal collaborative verification of audio and video is a technical problem that urgently needs to be solved. Summary of the Invention

[0005] In view of this, the purpose of this invention is to provide a multimodal data processing method, apparatus, device, and medium based on a sound event detection model and a visual detection model, capable of designing an intelligent annotation scheme for dual-modal collaborative verification of audio and video. The specific scheme is as follows: Firstly, this application provides a multimodal data processing method based on a sound event detection model and a visual detection model, including: Audio data is obtained from multimodal data, the occurrence probability of a target event in the audio data is determined by a sound event detection model, and a corresponding target probability curve is determined based on the occurrence probability of the event and a temporal smoothing technique. Based on the target probability curve and using a first preset probability threshold and a second preset probability threshold, the target time period in the audio data is determined, and the audio data corresponding to the target time period is determined as the target audio data; wherein, the first preset probability threshold is a threshold used to determine whether the target event in the audio data has started; the second preset probability threshold is a threshold used to determine whether the target event in the audio data has ended; the first preset probability threshold is greater than the second preset probability threshold; Obtain target video data corresponding to the target audio data from the multimodal data, and determine the probability of the existence of the target object in the target video data using a visual detection model; The target audio data and the target video data are validated for consistency based on the probability of the event occurring and the probability of the object existing. The target audio data and the target video data are then labeled based on the validation results. The labeled target audio data and the target video data are then used to train the target model.

[0006] Optionally, the step of determining the occurrence probability of a target event in the audio data through a sound event detection model, and determining a corresponding target probability curve based on the occurrence probability and temporal smoothing techniques, includes: The probability of occurrence of target events in each frame of the audio data is predicted by a sound event detection model to obtain an initial prediction result; From the initial prediction results, determine the target prediction result whose confidence level meets the preset high confidence condition; In the target prediction results of each frame of the audio data, the probability of events belonging to the same preset semantic category is accumulated and regularized to obtain an initial probability curve; The initial probability curve is subjected to a one-dimensional morphological closing operation using temporal smoothing techniques to obtain the target probability curve.

[0007] Optionally, determining the target time period in the audio data based on the target probability curve and using a first preset probability threshold and a second preset probability threshold includes: Determine the probability value in the target probability curve; In the audio data, the moments with a probability value greater than a first preset probability threshold are marked as the start time of the target event; In the audio data that is later than the start time, the time when the probability value is less than the second preset probability threshold is marked as the end time of the target event; The target time period in the audio data is determined based on the start time and the end time; wherein, during the target time period, the probability value corresponding to each time point other than the end time is greater than the second preset probability threshold.

[0008] Optionally, after determining the audio data corresponding to the target time period in the audio data as the target audio data, the method further includes: In the audio data, the moment earlier than the start time by a preset duration is determined as the first moment, and the moment later than the end time by the preset duration is determined as the second moment; The time interval between the first time and the start time, and the time interval between the end time and the second time are defined as buffer time intervals, and buffered audio data corresponding to the buffer time intervals are determined in the audio data; The audio data corresponding to the preset time period in the buffered audio data is subjected to preset audio enhancement processing or preset audio attenuation processing to obtain and save the corresponding processed audio data.

[0009] Optionally, the step of obtaining target video data corresponding to the target audio data from the multimodal data, and determining the probability of the existence of the target object in the target video data through a visual detection model, includes: Video frames corresponding to the target time period are extracted from the multimodal data using a preset frequency to obtain target video data corresponding to the target audio data; The set of object categories in the target video data is determined by a visual detection model, and the probability of the existence of the target object in the target video data is calculated based on the set of object categories.

[0010] Optionally, the step of labeling the target audio data and the target video data based on the obtained verification results, and training the target model using the labeled target audio data and the target video data, includes: If the verification result indicates that the probability of the object's existence is greater than a third preset probability threshold, and there is a preset correlation between the target event in the target audio data and the target object in the target video data, then a preset slicing operation is performed on the multimodal data based on the target audio data and the target video data to obtain the first sliced ​​data; Based on the target event and the target object, the first sliced ​​data is labeled accordingly to obtain the first labeled data, and the first labeled data is used to train the audio generation model and video generation model based on the latent diffusion model. If the verification result shows that the target audio data meets the preset high confidence condition, and there is no preset correlation between the target event in the target audio data and the target object in the target video data, then the preset slicing operation is performed on the multimodal data based on the target audio data and the target video data to obtain the second sliced ​​data; Label the second sliced ​​data with preset mismatched labels to obtain the second labeled data, or convert the sliced ​​data into structured data and use the second labeled data or the structured data to train a multimodal pre-trained model based on contrastive learning. If the verification results show that the target audio data meets the preset data integrity condition and the preset high signal-to-noise ratio condition, and the target video data meets the preset visual blur condition, and the target event in the target audio data and the target object in the target video data do not have the preset association relationship, then the target audio data is annotated accordingly based on the target event to obtain annotated audio data. The labeled audio data is saved to a preset audio dataset, and the preset audio dataset is used to fine-tune and optimize the sound event detection model.

[0011] Optionally, the multimodal data processing method based on the sound event detection model and the visual detection model further includes: Based on the probability of the event and a preset constant value, the signal-to-interference ratio (SIIR) of the target audio data is determined, and audio data in the target audio data whose SIIR is greater than a preset purity threshold is determined as pure audio segments. The bounding box information generated by the visual detection model in processing the target video data is determined, and the fill rate of the target object is determined based on the bounding box information; In the target video data, video data with a fill rate greater than a preset fill rate threshold are identified as target video segments. A supervised learning dataset is constructed based on the clean audio segment and the target video segment using linear mixing and data augmentation techniques, in order to train the audio separation model using the supervised learning dataset.

[0012] Secondly, this application provides a multimodal data processing device based on a sound event detection model and a visual detection model, comprising: The curve determination module is used to acquire audio data from multimodal data, determine the occurrence probability of a target event in the audio data through a sound event detection model, and determine the corresponding target probability curve based on the occurrence probability of the event and a temporal smoothing technique. A data determination module is used to determine the target time period in the audio data based on the target probability curve and using a first preset probability threshold and a second preset probability threshold, and to determine the audio data corresponding to the target time period in the audio data as target audio data; wherein, the first preset probability threshold is a threshold used to determine whether the target event in the audio data has started; the second preset probability threshold is a threshold used to determine whether the target event in the audio data has ended; the first preset probability threshold is greater than the second preset probability threshold; The probability determination module is used to obtain target video data corresponding to the target audio data from the multimodal data, and determine the probability of the existence of the target object in the target video data through a visual detection model; The data processing module is used to perform consistency verification on the target audio data and the target video data based on the probability of the event occurring and the probability of the object existing, and to annotate the target audio data and the target video data based on the obtained verification results, and to train the target model using the annotated target audio data and the target video data.

[0013] Thirdly, this application provides an electronic device, comprising: Memory, used to store computer programs; A processor is used to execute the computer program to implement the aforementioned multimodal data processing method based on a sound event detection model and a visual detection model.

[0014] Fourthly, this application provides a computer-readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, it implements the aforementioned multimodal data processing method based on a sound event detection model and a visual detection model.

[0015] In this application, audio data is obtained from multimodal data, and the occurrence probability of a target event in the audio data is determined by a sound event detection model. A corresponding target probability curve is determined based on the event occurrence probability and a temporal smoothing technique. Based on the target probability curve and using a first preset probability threshold and a second preset probability threshold, a target time period for the duration of the target event in the audio data is determined, and the audio data corresponding to the target time period is identified as the target audio data. The first preset probability threshold is a threshold used to determine whether the target event in the audio data has started; the second preset probability threshold is a threshold used to determine whether the target event in the audio data has ended; the first preset probability threshold is greater than the second preset probability threshold. Target video data corresponding to the target audio data is obtained from the multimodal data, and the object existence probability of the target object in the target video data is determined by a visual detection model. Consistency checks are performed on the target audio data and the target video data based on the event occurrence probability and the object existence probability. The target audio data and the target video data are labeled based on the obtained check results, and the labeled target audio data and the target video data are used to train a target model. As can be seen from the above, in this application, audio data is extracted from multimodal data, and the occurrence probability of the target event in the audio data is determined using a sound event detection model. A corresponding target probability curve is then determined based on the event occurrence probability and temporal smoothing techniques. According to the target probability curve, a first preset probability threshold and a second preset probability threshold are used to determine the duration of the target event in the audio data, and the portion of the audio data corresponding to the target duration is identified as the target audio data. Target video data corresponding to the target audio data is extracted from the multimodal data, and the object existence probability of the target object in the target video data is determined using a visual detection model. Consistency verification is performed on the target audio data and target video data by combining the event occurrence probability and the object existence probability. The target audio data and target video data are labeled based on the verification results, and the labeled target audio data and target video data are used to train the target model. In this way, this application can effectively segment only when auditory events and visual objects coexist in both time and semantics, greatly improving the purity and alignment of the training data. Attached Figure Description

[0016] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0017] Figure 1 This is a flowchart of a multimodal data processing method based on a sound event detection model and a visual detection model disclosed in this application; Figure 2 This is a schematic diagram of the structure of a multimodal data processing device based on a sound event detection model and a visual detection model disclosed in this application; Figure 3 This is a structural diagram of an electronic device disclosed in this application. Detailed Implementation

[0018] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0019] Currently, many audio models have been engineered and implemented, leading to a large-scale demand for audio training. In the field of multimedia data annotation, relying solely on audio features for automatic data cleaning often results in misjudgments. For example, a recording might contain the sound of a car engine, but the car might not appear in the video, reducing the training value of the data for the video content. Existing automated processing methods based on a single audio modality suffer from fundamental flaws such as auditory illusions and semantic mismatches. While current sound event detection models can extract specific audio events through probability thresholds, they are essentially signal-level recognitions and cannot determine whether the sound corresponds to the video content. To address this, this application provides a multimodal data processing method, apparatus, device, and medium based on sound event detection models and visual detection models, enabling the design of an intelligent annotation scheme for dual-modal collaborative verification of audio and video.

[0020] See Figure 1 As shown in the figure, this invention discloses a multimodal data processing method based on a sound event detection model and a visual detection model, including: Step S11: Obtain audio data from multimodal data, determine the event occurrence probability of the target event in the audio data through a sound event detection model, and determine the corresponding target probability curve based on the event occurrence probability and temporal smoothing technology.

[0021] In this embodiment, audio data is first extracted from the multimodal data to be processed. This audio data may contain various types of sound events. To accurately identify specific target events, the probability of occurrence of the target event in each frame of the audio data is predicted using a sound event detection model, resulting in an initial prediction result. After obtaining the initial prediction result, further processing is required to improve the accuracy and continuity of event detection. Target prediction results that meet a preset high-confidence condition can be determined from the initial prediction result. Subsequently, in the target prediction result of each frame of the audio data, the probability of occurrence of events belonging to the same preset semantic category is accumulated and regularized to obtain an initial probability curve. To eliminate minor spikes and discontinuities in the initial probability curve that may be caused by noise interference or model prediction fluctuations, making it smoother and more continuous, a one-dimensional morphological closing operation is performed on the initial probability curve using temporal smoothing technology to obtain a target probability curve. This target probability curve provides a reliable temporal probability basis for subsequent fusion with the processing results of the visual detection model.

[0022] Step S12: Based on the target probability curve and using a first preset probability threshold and a second preset probability threshold, determine the target time period in the audio data in which the target event lasts, and determine the audio data in the audio data corresponding to the target time period as the target audio data; wherein, the first preset probability threshold is a threshold used to determine whether the target event in the audio data has started; the second preset probability threshold is a threshold used to determine whether the target event in the audio data has ended; the first preset probability threshold is greater than the second preset probability threshold.

[0023] In this embodiment, after obtaining the target probability curve, it is necessary to locate the specific time period of the target event in the audio data based on the curve. To this end, two different probability thresholds are preset, namely a first preset probability threshold and a second preset probability threshold. The first preset probability threshold is used to determine the start of the target event, and its value is relatively high to ensure that the detected start time has a high confidence level. The second preset probability threshold is used to determine the end of the target event, and its value is relatively low so that the termination point can still be captured when the target event gradually weakens, avoiding premature truncation of the event due to an excessively high threshold.

[0024] In the specific implementation process, the probability values ​​in the target probability curve are determined; in the audio data, the moments with probability values ​​greater than a first preset probability threshold are marked as the start time of the target event; in the audio data later than the start time, the moments with probability values ​​less than a second preset probability threshold are marked as the end time of the target event; based on the start time and the end time, a target time period in which the target event lasts in the audio data is determined; wherein, in the target time period, except for the end time, the probability values ​​corresponding to all other moments are greater than the second preset probability threshold to ensure the continuity of the target event. Through the above method, the target time period in which the target event lasts in the audio data can be determined based on the start time and the end time, and then an audio segment corresponding to the target time period can be extracted from the original audio data as the target audio data for subsequent processing.

[0025] To further improve the accuracy and robustness of subsequent multimodal data fusion, after determining the target audio data, the audio data near the event boundary can be buffered. Specifically, in the audio data, a time earlier than the start time by a preset duration is determined as the first time, and a time later than the end time by the preset duration is determined as the second time. The time intervals between the first time and the start time, and between the end time and the second time, are determined as buffer time intervals, and buffered audio data corresponding to the buffer time intervals are determined in the audio data. Preset audio enhancement or attenuation processing is applied to the audio data corresponding to the preset time intervals in the buffered audio data to obtain and save the corresponding processed audio data. After the above processing, the processed audio data and the target audio data are concatenated and saved in chronological order to obtain a complete audio segment containing the target event and its contextual transition information.

[0026] Step S13: Obtain target video data corresponding to the target audio data from the multimodal data, and determine the probability of the existence of the target object in the target video data through a visual detection model.

[0027] In this embodiment, after determining the target audio data and its corresponding target time period, video data corresponding to that time period needs to be synchronously acquired from the original multimodal data. That is, video frames corresponding to the target time period are extracted from the multimodal data using a preset frequency to obtain the target video data corresponding to the target audio data. After obtaining the target video data, it is input into a pre-trained visual detection model for processing. The visual detection model determines the set of object categories in the target video data and, based on the set of object categories, calculates the probability of the existence of the target objects in the target video data.

[0028] Step S14: Perform consistency verification on the target audio data and the target video data based on the probability of the event occurring and the probability of the object existing, and label the target audio data and the target video data based on the obtained verification results, and train the target model using the labeled target audio data and the target video data.

[0029] In this embodiment, after obtaining the event occurrence probability output by the sound event detection model and the object existence probability output by the visual detection model, a consistency check needs to be performed on the target audio data and the target video data based on the event occurrence probability and the object existence probability to determine whether there is a reasonable correspondence between the target events detected in the audio data and the target objects detected in the video data. The result of the consistency check will determine the specific processing method and annotation strategy for the target audio data and target video data. Specifically, depending on the check result, various different implementation methods can be used to process and utilize the data.

[0030] In one specific implementation, if the verification result indicates that the probability of the object's existence is greater than a third preset probability threshold, and there is a preset correlation between the target event in the target audio data and the target object in the target video data, then a preset slicing operation is performed on the multimodal data based on the target audio data and the target video data to obtain first sliced ​​data. Subsequently, the first sliced ​​data is annotated accordingly based on the target event and the target object to obtain first annotated data. The first annotated data is then used to train the audio generation model and video generation model based on the latent diffusion model to improve the model's ability to generate one modality from another.

[0031] In another specific implementation, if the verification result indicates that the target audio data meets a preset high-confidence condition, and there is no preset correlation between the target event in the target audio data and the target object in the target video data, then the preset slicing operation is performed on the multimodal data based on the target audio data and the target video data to obtain second-sliced ​​data. To fully utilize this type of data to improve the model's discriminative ability, preset mismatched labels are annotated on the second-sliced ​​data to obtain second-annotated data, or the sliced ​​data is converted into structured data, and the second-annotated data or the structured data is used to train a multimodal pre-trained model based on contrastive learning to enhance the model's ability to distinguish between modal consistency and inconsistency.

[0032] In a third specific implementation, if the verification results show that the target audio data meets the preset data integrity condition and the preset high signal-to-noise ratio condition, and the target video data meets the preset visual blur condition, and the target event in the target audio data and the target object in the target video data do not have the preset correlation relationship, then the target audio data is annotated accordingly based on the target event to obtain annotated audio data. Subsequently, the annotated audio data is saved to a preset audio dataset, and the sound event detection model is fine-tuned and optimized using the preset audio dataset to improve the recognition accuracy and robustness of the sound event detection model in complex acoustic environments.

[0033] Furthermore, this embodiment also includes the operation of quality screening of target audio data and target video data to construct a specialized training dataset. Specifically, the signal-to-interference ratio (SIR) of the target audio data is determined based on the event occurrence probability and a preset constant value, and audio data with an SIR greater than a preset purity threshold is identified as pure audio segments. Simultaneously, the bounding box information generated by the visual detection model in processing the target video data is determined, and the fill rate of the target object is determined based on the bounding box information; video data with a fill rate greater than a preset fill rate threshold is identified as target video segments. Next, a supervised learning dataset is constructed using linear mixture processing and data augmentation techniques based on the pure audio segments and the target video segments. This supervised learning dataset is used to train the audio separation model, enabling it to effectively separate the audio signal of the target event in complex acoustic environments.

[0034] As can be seen from the above, in this application, audio data is extracted from multimodal data, and the occurrence probability of the target event in the audio data is determined using a sound event detection model. A corresponding target probability curve is then determined based on the event occurrence probability and temporal smoothing techniques. According to the target probability curve, a first preset probability threshold and a second preset probability threshold are used to determine the duration of the target event in the audio data, and the portion of the audio data corresponding to the target duration is identified as the target audio data. Target video data corresponding to the target audio data is extracted from the multimodal data, and the object existence probability of the target object in the target video data is determined using a visual detection model. Consistency verification is performed on the target audio data and target video data by combining the event occurrence probability and the object existence probability. The target audio data and target video data are labeled based on the verification results, and the labeled target audio data and target video data are used to train the target model. In this way, this application can effectively segment only when auditory events and visual objects coexist in both time and semantics, greatly improving the purity and alignment of the training data.

[0035] The technical solutions of the embodiments of this application will be described in detail below.

[0036] Specifically, the SED (Sound Event Detection) model is first used to perform frame-by-frame inference on the input audio stream, calculating the probability of event occurrence and performing anti-interference processing: Firstly, tag aggregation and Top-K processing are performed. That is, based on a preset tag list, tags belonging to the same semantic category are grouped together, such as combining "Aircraft" and "Jet engine" into "Aircraft". In the Top-K results of each frame, the probabilities are accumulated and regularized to obtain the comprehensive probability curve of that category. Secondly, morphological temporal smoothing is performed. This involves eliminating high-frequency jitter in the probability curve and preventing event interruptions caused by brief signal blockages. Performing a one-dimensional morphological closing operation (Closing), the corresponding calculation formula is as follows: This step can automatically connect the same physical event that has been interrupted by millisecond-level low confidence, ensuring semantic continuity.

[0037] To address the oversegmentation issue caused by a single threshold in existing technologies, this solution sets a trigger threshold. With the retention threshold ,and Based on the established dual thresholds, the event determination process can be divided into event triggering, event maintenance, and event termination (Offset). Event triggering is defined as follows: if and only if... When the marker event begins This ensures high confidence in the event core. The event is maintained as follows: once triggered, as long as... Stay The above is considered as the event still ongoing. This allows the system to fully capture the "tail end" of the sound, such as vehicles moving away or instrument reverberation, avoiding premature cutoff. The event ends when... When the marker event ends. Finally, calculate the duration. Only retain those that meet the requirements. The segment is used as an audio candidate event.

[0038] For each generated audio candidate event time window The system initiates visual verification. First, dynamic frame extraction is performed: video frames are extracted at a fixed frequency of 5fps within the corresponding time period to avoid wasting computational resources on full frame-by-frame detection. Second, object detection is performed: the YOLO model (an object detection model) is input, and the set of object categories C_visual within the current window is output. Finally, visual presence determination is performed: the proportion of detected frames for specific objects (such as airplanes, cars, and persons) within the time window is statistically analyzed to confirm the presence of the visual subject.

[0039] Determine the start and end times of the final slice. To avoid waveform truncation pops and spectral leakage caused by direct physical cutting, the system performs fade-in and fade-out processing before saving the audio file. First, padding is performed: if the original file boundaries are not exceeded, a buffer of 100ms-500ms is added before T_start and after T_end to capture the natural onset and reverberation of the sound. Second, envelope application is performed: fade-in, applying a linear or logarithmic gradually increasing gain curve to the first 10ms-50ms of the slice; fade-out, applying a linear or logarithmic gradually decreasing gain curve to the last 10ms-50ms of the slice. Finally, the processed PCM (an audio format) data is encoded and saved as the final training sample file. This step ensures that the energy of the generated audio clip smoothly returns to zero at both ends, eliminating high-frequency noise interference caused by zero-point drift.

[0040] Based on the audio-visual consistency verification results, instead of simply performing a binary "keep / discard" operation, a three-way splitting strategy of "positive sample slicing - negative sample mining - pure audio recycling" is implemented. Specifically, for branch one: multimodal positive samples (Positive Pairs), they are used to train the generative model. The decision logic is: audio event... Triggered by two thresholds, and the corresponding object was detected in the video. Processing action: Perform precise slicing and save as (Video, Audio, Label) triples. This can be used to train the positive alignment capabilities of text-based audio and video models such as AudioLDM and Sora.

[0041] Branch two, Hard Negative Samples, is used to train the contrastive learning model. The decision logic is as follows: If a high-confidence event is detected in the audio, such as... However, the video did not detect the object at all within the corresponding time window, i.e. Furthermore, the video is not black or invalid, indicating normal video confidence. Processing steps: Slice the segment and label it with special tags for mismatches or save it as structured data. This type of data is extremely valuable when training models such as CLAP (Contrastive Language-Audio Pretraining). It can teach the model that "hearing a car sound but not seeing a car does not constitute a match," thus significantly reducing the hallucination rate during inference.

[0042] Branch 3 involves the recycling of pure audio resources to feed back into the SED model. The criteria are: high audio quality (high signal-to-noise ratio, complete event), but blurry, occluded, or completely irrelevant visuals (e.g., lens cap not open). The processing steps are: decoupling and extraction: stripping the video track and extracting only the audio stream. Reclassification: saving it as a pure audio dataset in .wav format and automatically labeling it based on the SED results. While this data cannot be used for multimodal training, it can be fed back into AudioSet (an audio event dataset) to expand the pure audio dataset, further fine-tuning and optimizing the SED model in this system, achieving the system's "self-evolution."

[0043] To address the high requirements for the purity of single sound sources when training audio separation models, this system designs an "exclusive purification" mode: Audio exclusivity constraint: When calculating frame-level probabilities, a proxy index for signal-to-interference ratio (SIR) is introduced. The formula for calculating this index is as follows: ; In the formula, The probability of the target event occurring in the frame. It represents the sum of the probabilities of occurrence of all events in the frame except the target event. It is a very small positive number.

[0044] In this embodiment, samples are marked as pure only when SIR_proxy is greater than a preset purity threshold. This means the system discards segments that "detect the target but also detect human voices or other high-frequency interference." Visual dominance constraint: Introducing visual saliency filtering. Using the bounding box information output by YOLO, the fill rate of the target object is calculated. The system only retains segments with a fill rate greater than a preset value such as 40% and no other interfering objects in the image. This strategy is based on the "near-field effect" principle, ensuring that the extracted audio is a dominant sound source with high loudness and high signal-to-noise ratio, thereby obtaining near-studio-quality single-track material. Construction of synthetic training data: Using the pure events output by this system and other pure backgrounds, a supervised learning dataset containing "mixture," "source," and "noise" is automatically constructed through random linear mixing and other enhancement techniques, directly empowering the training of the separation model.

[0045] Therefore, this application introduces a dual-threshold hysteresis mechanism, utilizing the Schmidt trigger principle of "high threshold establishes existence, low threshold maintains continuity." It can completely capture the entire process of sound from "fade-in-climax-fade-out," without interruption due to fluctuations or accidental triggering due to noise at the beginning and end, perfectly balancing completeness and accuracy. Achieving generalization across all AudioSet categories is not only applicable to vehicles; the "morphological smoothing" strategy of this solution is also applicable to non-rigid, discontinuous acoustic events, such as intermittent thunder and dialogue pauses, demonstrating strong versatility. Utilizing YOLO visual semantics as a "hard constraint," the semantic misalignment problem caused by off-screen audio is completely eliminated, producing high-quality multimodal aligned data without manual intervention. Existing technologies typically treat "audio-visual inconsistency," such as off-screen audio, as noise data and directly discard it, resulting in low data utilization and wasted valuable correction material. This solution innovatively defines "sound without image" segments as "difficult negative samples" and automatically archives them separately. Such samples are crucial for training multimodal contrastive learning models. They force the model to learn to distinguish between "true alignment" and "coincidental coexistence" through an "explicit penalty" mechanism, thereby significantly improving the multimodal model's robustness to interference and semantic alignment accuracy in complex scenarios.

[0046] Accordingly, see Figure 2 As shown in the figure, this application provides a multimodal data processing device based on a sound event detection model and a visual detection model, including: The curve determination module 11 is used to acquire audio data from multimodal data, determine the event occurrence probability of the target event in the audio data through a sound event detection model, and determine the corresponding target probability curve based on the event occurrence probability and temporal smoothing technology. The data determination module 12 is used to determine the target time period in the audio data based on the target probability curve and using a first preset probability threshold and a second preset probability threshold, and to determine the audio data corresponding to the target time period in the audio data as target audio data; wherein, the first preset probability threshold is a threshold used to determine whether the target event in the audio data has started; the second preset probability threshold is a threshold used to determine whether the target event in the audio data has ended; the first preset probability threshold is greater than the second preset probability threshold; The probability determination module 13 is used to obtain target video data corresponding to the target audio data from the multimodal data, and determine the probability of the existence of the target object in the target video data through a visual detection model; The data processing module 14 is used to perform consistency verification on the target audio data and the target video data based on the probability of the event occurring and the probability of the object existing, and to annotate the target audio data and the target video data based on the obtained verification results, and to train the target model using the annotated target audio data and the target video data.

[0047] In some specific embodiments, the curve determination module 11 specifically includes: The probability prediction unit is used to predict the probability of occurrence of target events in each frame of the audio data through a sound event detection model, and obtain an initial prediction result. The result determination unit is used to determine the target prediction result whose confidence level meets the preset high confidence condition from the initial prediction result; The probability processing unit is used to perform probability accumulation and regularization operations on the probability of events belonging to the same preset semantic category in the target prediction result of each frame of the audio data to obtain an initial probability curve. The curve determination unit is used to perform a one-dimensional morphological closing operation on the initial probability curve using time-domain smoothing techniques to obtain the target probability curve.

[0048] In some specific embodiments, the data determination module 12 specifically includes: A probability determination unit is used to determine the probability value in the target probability curve; The first moment marking unit is used to mark the moment in the audio data where the probability value is greater than a first preset probability threshold as the start moment of the target event; The second time marker unit is used to mark the time when the probability value is less than a second preset probability threshold as the end time of the target event in the audio data that is later than the start time of the audio data. A time period determination unit is used to determine a target time period in the audio data based on the start time and the end time; wherein, in the target time period, the probability value corresponding to each time other than the end time is greater than the second preset probability threshold.

[0049] In some specific embodiments, the data determination module 12 further includes: A time determination unit is used to determine, in the audio data, a time earlier than the start time by a preset duration as a first time, and a time later than the end time by the preset duration as a second time; The first data determination unit is used to determine the time period between the first time and the start time, and the time period between the end time and the second time as a buffer time period, and to determine the buffer audio data corresponding to the buffer time period in the audio data. The data processing unit is used to perform preset audio enhancement processing or preset audio attenuation processing on the audio data in the buffered audio data corresponding to the preset time period, and to obtain and save the corresponding processed audio data.

[0050] In some specific embodiments, the probability determination module 13 specifically includes: The second data determination unit is used to extract video frames corresponding to the target time period from the multimodal data using a preset frequency, so as to obtain target video data corresponding to the target audio data. The probability statistics unit is used to determine the set of object categories in the target video data through a visual detection model, and to calculate the probability of the existence of the target object in the target video data based on the set of object categories.

[0051] In some specific embodiments, the data processing module 14 specifically includes: The first data slicing unit is used to perform a preset slicing operation on the multimodal data based on the target audio data and the target video data if the verification result shows that the probability of the object exists is greater than a third preset probability threshold, and there is a preset correlation between the target event in the target audio data and the target object in the target video data, so as to obtain the first sliced ​​data. The first model training unit is used to annotate the first sliced ​​data according to the target event and the target object to obtain the first annotated data, and to train the audio generation model and video generation model based on the latent diffusion model using the first annotated data. The second data slicing unit is used to perform the preset slicing operation on the multimodal data based on the target audio data and the target video data if the verification result shows that the target audio data meets the preset high confidence condition and there is no preset correlation between the target event in the target audio data and the target object in the target video data, so as to obtain the second sliced ​​data. The second model training unit is used to label the second sliced ​​data with preset mismatched labels to obtain the second labeled data, or to convert the sliced ​​data into structured data and use the second labeled data or the structured data to train a multimodal pre-trained model based on contrastive learning. The data annotation unit is used to annotate the target audio data based on the target event if the verification result shows that the target audio data meets the preset data integrity condition and the preset high signal-to-noise ratio condition, and the target video data meets the preset visual blur condition, and the target event in the target audio data and the target object in the target video data do not have the preset correlation relationship, so as to obtain annotated audio data. The model adjustment unit is used to save the labeled audio data to a preset audio dataset and to fine-tune and optimize the sound event detection model using the preset audio dataset.

[0052] In some specific embodiments, the multimodal data processing device based on the sound event detection model and the visual detection model further includes: The first segment determination unit is used to determine the signal-to-interference ratio (SIRR) of the target audio data based on the probability of the event and a preset constant value, and to determine audio data in the target audio data whose SIRR is greater than a preset purity threshold as pure audio segments. A fill rate determination unit is used to determine the bounding box information generated by the visual detection model in processing the target video data, and to determine the fill rate of the target object based on the bounding box information. The second segment determination unit is used to determine the video data in the target video data whose screen fill rate is greater than a preset fill rate threshold as the target video segment; The third model training unit is used to construct a supervised learning dataset based on the clean audio segment and the target video segment using linear mixture techniques and data augmentation techniques, so as to train the audio separation model using the supervised learning dataset.

[0053] Furthermore, embodiments of this application also disclose an electronic device, Figure 3 This is a structural diagram of an electronic device 20 according to an exemplary embodiment. The content of the diagram should not be construed as limiting the scope of this application. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input / output interface 25, and a communication bus 26. The memory 22 stores a computer program, which is loaded and executed by the processor 21 to implement the relevant steps in the multimodal data processing method based on the sound event detection model and the visual detection model disclosed in any of the foregoing embodiments. Furthermore, the electronic device 20 in this embodiment may specifically be an electronic computer.

[0054] In this embodiment, the power supply 23 is used to provide operating voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and external devices, and the communication protocol it follows can be any communication protocol applicable to the technical solution of this application, and is not specifically limited here; the input / output interface 25 is used to acquire external input data or output data to the outside world, and its specific interface type can be selected according to specific application needs, and is not specifically limited here.

[0055] In addition, the memory 22, as a carrier for resource storage, can be a read-only memory, random access memory, disk, or optical disk, etc. The resources stored thereon can include an operating system 221, computer programs 222, etc., and the storage method can be temporary storage or permanent storage.

[0056] The operating system 221 is used to manage and control the various hardware devices on the electronic device 20 and the computer program 222, which may be Windows Server, Netware, Unix, Linux, etc. In addition to including a computer program capable of performing the multimodal data processing method based on a sound event detection model and a visual detection model, which is executed by the electronic device 20 according to any of the foregoing embodiments, the computer program 222 may further include computer programs capable of performing other specific tasks.

[0057] Furthermore, this application also discloses a computer-readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, it implements the aforementioned multimodal data processing method based on a sound event detection model and a visual detection model. Specific steps of this method can be found in the corresponding content disclosed in the foregoing embodiments, and will not be repeated here.

[0058] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus disclosed in the embodiments, since it corresponds to the method disclosed in the embodiments, the description is relatively simple; relevant parts can be referred to in the method section.

[0059] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0060] The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein can be implemented directly by hardware, a software module executed by a processor, or a combination of both. The software module can be located in random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art.

[0061] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0062] The technical solutions provided in this application have been described in detail above. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the methods and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.

Claims

1. A multimodal data processing method based on a sound event detection model and a visual detection model, characterized in that, include: Audio data is obtained from multimodal data, the occurrence probability of a target event in the audio data is determined by a sound event detection model, and a corresponding target probability curve is determined based on the occurrence probability of the event and a temporal smoothing technique. Based on the target probability curve and using a first preset probability threshold and a second preset probability threshold, the target time period in the audio data is determined, and the audio data corresponding to the target time period is determined as the target audio data; wherein, the first preset probability threshold is a threshold used to determine whether the target event in the audio data has started; the second preset probability threshold is a threshold used to determine whether the target event in the audio data has ended; the first preset probability threshold is greater than the second preset probability threshold; Obtain target video data corresponding to the target audio data from the multimodal data, and determine the probability of the existence of the target object in the target video data using a visual detection model; The target audio data and the target video data are validated for consistency based on the probability of the event occurring and the probability of the object existing. The target audio data and the target video data are then labeled based on the validation results. The labeled target audio data and the target video data are then used to train the target model.

2. The multimodal data processing method based on a sound event detection model and a visual detection model according to claim 1, characterized in that, The step of determining the occurrence probability of a target event in the audio data using a sound event detection model, and determining the corresponding target probability curve based on the event occurrence probability and temporal smoothing techniques, includes: The probability of occurrence of target events in each frame of the audio data is predicted by a sound event detection model to obtain an initial prediction result; From the initial prediction results, determine the target prediction result whose confidence level meets the preset high confidence condition; In the target prediction results of each frame of the audio data, the probability of events belonging to the same preset semantic category is accumulated and regularized to obtain an initial probability curve; The initial probability curve is subjected to a one-dimensional morphological closing operation using temporal smoothing techniques to obtain the target probability curve.

3. The multimodal data processing method based on a sound event detection model and a visual detection model according to claim 1, characterized in that, The step of determining the target time period of the target event in the audio data based on the target probability curve and using a first preset probability threshold and a second preset probability threshold includes: Determine the probability value in the target probability curve; In the audio data, the moments with a probability value greater than a first preset probability threshold are marked as the start time of the target event; In the audio data that is later than the start time, the time when the probability value is less than the second preset probability threshold is marked as the end time of the target event; The target time period in the audio data is determined based on the start time and the end time; wherein, during the target time period, the probability value corresponding to each time point other than the end time is greater than the second preset probability threshold.

4. The multimodal data processing method based on a sound event detection model and a visual detection model according to claim 3, characterized in that, After determining the audio data corresponding to the target time period in the audio data as the target audio data, the method further includes: In the audio data, the moment earlier than the start time by a preset duration is determined as the first moment, and the moment later than the end time by the preset duration is determined as the second moment; The time interval between the first time and the start time, and the time interval between the end time and the second time are defined as buffer time intervals, and buffered audio data corresponding to the buffer time intervals are determined in the audio data; The audio data corresponding to the preset time period in the buffered audio data is subjected to preset audio enhancement processing or preset audio attenuation processing to obtain and save the corresponding processed audio data.

5. The multimodal data processing method based on a sound event detection model and a visual detection model according to claim 1, characterized in that, The step of obtaining target video data corresponding to the target audio data from the multimodal data, and determining the probability of the existence of the target object in the target video data through a visual detection model, includes: Video frames corresponding to the target time period are extracted from the multimodal data using a preset frequency to obtain target video data corresponding to the target audio data; The set of object categories in the target video data is determined by a visual detection model, and the probability of the existence of the target object in the target video data is calculated based on the set of object categories.

6. The multimodal data processing method based on a sound event detection model and a visual detection model according to any one of claims 1 to 5, characterized in that, The step of labeling the target audio data and the target video data based on the obtained verification results, and then training the target model using the labeled target audio data and the target video data, includes: If the verification result indicates that the probability of the object's existence is greater than a third preset probability threshold, and there is a preset correlation between the target event in the target audio data and the target object in the target video data, then a preset slicing operation is performed on the multimodal data based on the target audio data and the target video data to obtain the first sliced ​​data; Based on the target event and the target object, the first sliced ​​data is labeled accordingly to obtain the first labeled data, and the first labeled data is used to train the audio generation model and video generation model based on the latent diffusion model. If the verification result shows that the target audio data meets the preset high confidence condition, and there is no preset correlation between the target event in the target audio data and the target object in the target video data, then the preset slicing operation is performed on the multimodal data based on the target audio data and the target video data to obtain the second sliced ​​data; Label the second sliced ​​data with preset mismatched labels to obtain the second labeled data, or convert the sliced ​​data into structured data and use the second labeled data or the structured data to train a multimodal pre-trained model based on contrastive learning. If the verification results show that the target audio data meets the preset data integrity condition and the preset high signal-to-noise ratio condition, and the target video data meets the preset visual blur condition, and the target event in the target audio data and the target object in the target video data do not have the preset association relationship, then the target audio data is annotated accordingly based on the target event to obtain annotated audio data. The labeled audio data is saved to a preset audio dataset, and the preset audio dataset is used to fine-tune and optimize the sound event detection model.

7. The multimodal data processing method based on a sound event detection model and a visual detection model according to claim 1, characterized in that, Also includes: Based on the probability of the event and a preset constant value, the signal-to-interference ratio (SIIR) of the target audio data is determined, and audio data in the target audio data whose SIIR is greater than a preset purity threshold is determined as pure audio segments. Determine the bounding box information generated by the visual detection model in processing the target video data, and determine the fill rate of the target object based on the bounding box information; In the target video data, video data with a fill rate greater than a preset fill rate threshold are identified as target video segments. A supervised learning dataset is constructed based on the clean audio segment and the target video segment using linear mixing and data augmentation techniques, in order to train the audio separation model using the supervised learning dataset.

8. A multimodal data processing device based on a sound event detection model and a visual detection model, characterized in that, include: The curve determination module is used to acquire audio data from multimodal data, determine the occurrence probability of a target event in the audio data through a sound event detection model, and determine the corresponding target probability curve based on the occurrence probability of the event and a temporal smoothing technique. A data determination module is used to determine the target time period in the audio data based on the target probability curve and using a first preset probability threshold and a second preset probability threshold, and to determine the audio data corresponding to the target time period in the audio data as target audio data; wherein, the first preset probability threshold is a threshold used to determine whether the target event in the audio data has started; the second preset probability threshold is a threshold used to determine whether the target event in the audio data has ended; the first preset probability threshold is greater than the second preset probability threshold; The probability determination module is used to obtain target video data corresponding to the target audio data from the multimodal data, and determine the probability of the existence of the target object in the target video data through a visual detection model; The data processing module is used to perform consistency verification on the target audio data and the target video data based on the probability of the event occurring and the probability of the object existing, and to annotate the target audio data and the target video data based on the obtained verification results, and to train the target model using the annotated target audio data and the target video data.

9. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor is configured to execute the computer program to implement the multimodal data processing method based on a sound event detection model and a visual detection model as described in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, Used to store computer programs; wherein, when the computer programs are executed by a processor, they implement the multimodal data processing method based on a sound event detection model and a visual detection model as described in any one of claims 1 to 7.