A method and system for training an active speaker detection model in a noisy environment

By employing an end-to-end multi-task optimization framework and dynamic visual cue aggregation, the problem of false positives and false negatives in speaker detection of audiovisual activities in noisy environments is solved, improving the model's discrimination accuracy in noisy environments and inference efficiency in multi-face scenarios.

CN122245324APending Publication Date: 2026-06-19SHANDONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANDONG UNIV
Filing Date
2026-03-13
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing methods for detecting speakers in audiovisual activities exhibit significant performance degradation in noisy environments, resulting in false positives and false negatives. Furthermore, they suffer from computational redundancy and optimization difficulties in scenarios with multiple faces.

Method used

By constructing an end-to-end multi-task joint optimization framework, and combining dynamic visual cue aggregation and multi-dimensional decoupled audiovisual separation guidance network, robust audio representations are learned to achieve denoising and focusing of visual priors, thereby reducing inference overhead.

Benefits of technology

It significantly improves the model's discrimination accuracy and generalization ability in noisy environments, reduces the inference complexity in multi-face scenarios, and enhances the accuracy of visual guidance and the training stability of the model.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245324A_ABST
    Figure CN122245324A_ABST
Patent Text Reader

Abstract

This invention proposes a training method and system for an active speaker detection model in noisy environments, belonging to the interdisciplinary field of speech signal processing and computer vision. Within an end-to-end multi-task framework, this invention uses audio / video speech enhancement and separation as guiding tasks, jointly optimizing them with the active speaker detection task using a shared audio encoder. During training, lip-sync energy and quality confidence are first calculated, and scene-level visual priors are obtained through cross-subject arbitration. Then, the aligned visual priors are fused with the complex spectral features of the mixed speech and input into a multi-dimensional decoupled separation guiding network to obtain enhanced speech and robust audio representations. Finally, parameters are updated via weighted backpropagation using separation and detection losses. This invention enables end-to-end learning of robust audio representations that are both "clean" and "effective for ASD discrimination" in noisy environments, and reduces inference overhead in multi-person face scenes while achieving denoising and focusing of visual priors.
Need to check novelty before this filing date? Find Prior Art