UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios
Le Thien Phuc Nguyen * 1
Zhuoran Yu * 1
Khoa Quang Nhat Cao 1
Yuwei Guo 1
Tu Ho Manh Pham 1
Tuan Tai Nguyen 1
Toan Ngo Duc Vo 1
Lucas Poon 1
Soochan Lee 2
Yong Jae Lee 1
(* Equal Contribution)
1University of Wisconsin - Madison 2Kookmin University
[Paper]
[Code]
[Data]

TL;DR

We introduce UniTalk, a challenging active speaker detection dataset designed to enhance model generalization in diverse real-world scenarios.

Abstract

We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that the ASD task remains far from solved under realistic conditions. Nevertheless, models trained on UniTalk demonstrate stronger generalization to modern “in-the-wild” datasets like Talkies and ASW, as well as to AVA. UniTalk thus establishes a new benchmark for active speaker detection, providing researchers with a valuable resource for developing and evaluating versatile and resilient models.

Comparison with other benchmarks

Table 1: Quantitative comparison between ASD datasets.
All statistics are computed over the combined training and test sets of each dataset. The highest value in each row is shown in bold.
Statistics AVA ASW Talkies UniTalk
Total hours 38.5 23 4.2 44.5
Total face tracks 37,738 8,000 23,508 48,693
Total face crops 3.4M 407K 799K 4M
Average speakers per frame 1.5 1.9 2.3 2.6

Examples


Experimental Results

Leaderboard
Name Underrepresented Languages Background Noise Crowded Scenes Mixed Condition Overall
LoCoNet + TalkNCE 86.7 84.9 84.1 77.9 83.2
LoCoNet 85.8 84.6 80.0 76.2 82.2
TalkNet 80.1 77.6 67.1 70.3 75.7
ASC 74.7 62.9 53.4 57.3 61.4
ASDNet 30.8 17.5 14.8 20.3 20.6
For future models, please contact us so that we can verify your result and update our leaderboard.
Dataset Comparison
Train on \ Evaluation AVA Talkies ASW UniTalk
AVA 95.5 88.3 88.5 77.5
Talkies 55.7 95.6 84.5 59.9
ASW 29.2 58.8 96.1 33.8
UniTalk 88.0 91.4 90.4 83.2
To compare between different datasets, we use the state-of-the-art model at the current time: LoCoNet + TalkNCE.

Paper and Supplementary Material

Le Thien Phuc Nguyen*, Zhuoran Yu*, Khoa Quang Nhat Cao, Yuwei Guo, Tu Ho Manh Pham, Tuan Tai Nguyen, Toan Ngo Duc Vo, Lucas Poon, Soochahn Lee, Yong Jae Lee (* Equal Contribution)
UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios
(hosted on ArXiv)


[Bibtex]


Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.