A method and system for training an active speaker detection model in a noisy environment
By employing an end-to-end multi-task optimization framework and dynamic visual cue aggregation, the problem of false positives and false negatives in speaker detection of audiovisual activities in noisy environments is solved, improving the model's discrimination accuracy in noisy environments and inference efficiency in multi-face scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANDONG UNIV
- Filing Date
- 2026-03-13
- Publication Date
- 2026-06-19
AI Technical Summary
Existing methods for detecting speakers in audiovisual activities exhibit significant performance degradation in noisy environments, resulting in false positives and false negatives. Furthermore, they suffer from computational redundancy and optimization difficulties in scenarios with multiple faces.
By constructing an end-to-end multi-task joint optimization framework, and combining dynamic visual cue aggregation and multi-dimensional decoupled audiovisual separation guidance network, robust audio representations are learned to achieve denoising and focusing of visual priors, thereby reducing inference overhead.
It significantly improves the model's discrimination accuracy and generalization ability in noisy environments, reduces the inference complexity in multi-face scenarios, and enhances the accuracy of visual guidance and the training stability of the model.
Smart Images

Figure CN122245324A_ABST