LASER
Lip Landmark Assisted Speaker dEtection for Robustness

Zhuoran Yu*

Yong Jae Lee

(* Equal Contribution)

University of Wisconsin - Madison

[Paper]

[GitHub]

Qualitative results of LASER, TalkNCE, LoCoNet on two unsynchronized videos. We create non-speaking scenarios by swapping the audio tracks of two videos where the same person is speaking with similar camera angles. The red box means the model predicts not-speaking. The green box means the model predicts speaking. The yellow circle means the model makes a wrong prediction.

Abstract

Active Speaker Detection (ASD) aims to identify speaking individuals in complex visual scenes. While humans can easily detect speech by matching lip movements to audio, current ASD models struggle to establish this correspondence, often misclassifying non-speaking instances when audio and lip movements are unsynchronized. To address this limitation, we propose Lip landmark Assisted Speaker dEtection for Robustness (LASER). Unlike models that rely solely on facial frames, LASER explicitly focuses on lip movements by integrating lip landmarks in training. Specifically, given a face track, LASER extracts frame-level visual features and the 2D coordinates of lip landmarks using a lightweight detector. These coordinates are encoded into dense feature maps, providing spatial and structural information on lip positions. Recognizing that landmark detectors may sometimes fail under challenging conditions (e.g., low resolution, occlusions, extreme angles), we incorporate an auxiliary consistency loss to align predictions from both lip-aware and face-only features, ensuring reliable performance even when lip data is absent. Extensive experiments across multiple datasets show that LASER outperforms state-of-the-art models, especially in scenarios with desynchronized audio and visuals, demonstrating robust performance in real-world video contexts.

Our Method

Given a face track V, we first obtain a lip landmark track using a facial landmark detector and encode the 2D coordinates of these landmarks into continuous 2D feature maps.
These maps are then aggregated through a 1x1 convolution layer.
The encoded lip track is concatenated with visual features from a 3D CNN and fed into ResNet and V-TCN to capture a temporal visual representation, which is further processed by context modeling modules to produce the final prediction.

For illustration, we use Long-term Intra-speaker Modeling (LIM) and Short-term Inter-speaker Modeling (SIM) from LoCoNet; however, our LASER is not limited to LoCoNet and can be integrated with other models.

Results on unsynchronized videos

We trained both LoCoNet and LASER on the synchronized version of AVA Active Speaker Detection dataset and evalute both of them on the unsynchronized version of AVA, Talkies , and ASW as OOD evaluation.

Bibtex

	
		@misc{nguyen2025laserliplandmarkassisted,
			title={LASER: Lip Landmark Assisted Speaker Detection for Robustness}, 
			author={Le Thien Phuc Nguyen and Zhuoran Yu and Yong Jae Lee},
			year={2025},
			eprint={2501.11899},
			archivePrefix={arXiv},
			primaryClass={cs.CV},
			url={https://arxiv.org/abs/2501.11899}, 
		}

Acknowledgements

We want to thanks University of Wisconsin - Madison for providing gpu resources. Thank you Utkarsh Ojha for proofreading our paper. Many thanks to the LoConet's author Xizi Wang and TalkNCE's author Chaeyoung Jung for supports and answering related questions. This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.