Active Speaker Detection (ASD) aims to identify speaking individuals in complex visual scenes.
While humans can easily detect speech by matching lip movements to audio, current ASD models
struggle to establish this correspondence, often misclassifying non-speaking instances when audio
and lip movements are unsynchronized. To address this limitation, we propose Lip landmark Assisted Speaker
dEtection for Robustness (LASER). Unlike models that rely solely on facial frames, LASER explicitly focuses
on lip movements by integrating lip landmarks in training. Specifically, given a face track, LASER extracts
frame-level visual features and the 2D coordinates of lip landmarks using a lightweight detector.
These coordinates are encoded into dense feature maps, providing spatial and structural information
on lip positions. Recognizing that landmark detectors may sometimes fail under challenging conditions
(e.g., low resolution, occlusions, extreme angles), we incorporate an auxiliary consistency loss to align
predictions from both lip-aware and face-only features, ensuring reliable performance even when lip data
is absent. Extensive experiments across multiple datasets show that LASER outperforms state-of-the-art models,
especially in scenarios with desynchronized audio and visuals, demonstrating robust performance in real-world video contexts.
|