A method and system for training an active speaker detection model in a noisy environment

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing an end-to-end multi-task optimization framework and dynamic visual cue aggregation, the problem of false positives and false negatives in speaker detection of audiovisual activities in noisy environments is solved, improving the model's discrimination accuracy in noisy environments and inference efficiency in multi-face scenarios.

CN122245324APending Publication Date: 2026-06-19SHANDONG UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SHANDONG UNIV
Filing Date: 2026-03-13
Publication Date: 2026-06-19

Application Information

Patent Timeline

13 Mar 2026

Application

19 Jun 2026

Publication

CN122245324A

IPC: G10L17/04; G10L17/10; G06V40/70; G06V40/16; G06N5/04; G06N3/045; G06N3/0495; G06N3/084; G06N3/0985

AI Tagging

Application Domain

Speech analysis Biological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Raindrop acoustic signal assisted method for removing rain interference from a patrol image
CN121903890BImage enhancement Image analysis
A real-time audio Ethernet transmission and processing system based on double FPGA
CN122204843ASpeech analysis Transmission
Electronic device for detecting speech rate and method for detecting speech rate
CN122224204ASpeech analysis
A method, device and medium for intelligent control of light
CN117636911BElectrical apparatus Speech analysis
Method and apparatus for auditory training
US20260162561A1Data processing applicationsEar treatment

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing methods for detecting speakers in audiovisual activities exhibit significant performance degradation in noisy environments, resulting in false positives and false negatives. Furthermore, they suffer from computational redundancy and optimization difficulties in scenarios with multiple faces.

Method used

By constructing an end-to-end multi-task joint optimization framework, and combining dynamic visual cue aggregation and multi-dimensional decoupled audiovisual separation guidance network, robust audio representations are learned to achieve denoising and focusing of visual priors, thereby reducing inference overhead.

Benefits of technology

It significantly improves the model's discrimination accuracy and generalization ability in noisy environments, reduces the inference complexity in multi-face scenarios, and enhances the accuracy of visual guidance and the training stability of the model.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122245324A_ABST

Patent Text Reader

Abstract

This invention proposes a training method and system for an active speaker detection model in noisy environments, belonging to the interdisciplinary field of speech signal processing and computer vision. Within an end-to-end multi-task framework, this invention uses audio / video speech enhancement and separation as guiding tasks, jointly optimizing them with the active speaker detection task using a shared audio encoder. During training, lip-sync energy and quality confidence are first calculated, and scene-level visual priors are obtained through cross-subject arbitration. Then, the aligned visual priors are fused with the complex spectral features of the mixed speech and input into a multi-dimensional decoupled separation guiding network to obtain enhanced speech and robust audio representations. Finally, parameters are updated via weighted backpropagation using separation and detection losses. This invention enables end-to-end learning of robust audio representations that are both "clean" and "effective for ASD discrimination" in noisy environments, and reduces inference overhead in multi-person face scenes while achieving denoising and focusing of visual priors.

Need to check novelty before this filing date? Find Prior Art