TL;DR

We introduce UniTalk, a challenging active speaker detection dataset designed to enhance model generalization in diverse real-world scenarios.

Abstract

We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that the ASD task remains far from solved under realistic conditions. Nevertheless, models trained on UniTalk demonstrate stronger generalization to modern “in-the-wild” datasets like Talkies and ASW, as well as to AVA. UniTalk thus establishes a new benchmark for active speaker detection, providing researchers with a valuable resource for developing and evaluating versatile and resilient models.

Comparison with other benchmarks

**Table 1: Quantitative comparison between ASD datasets.**
All statistics are computed over the combined training and test sets of each dataset. The highest value in each row is shown in **bold**.
Statistics	AVA	ASW	Talkies	UniTalk
Total hours	38.5	23	4.2	44.5
Total face tracks	37,738	8,000	23,508	48,693
Total face crops	3.4M	407K	799K	4M
Average speakers per frame	1.5	1.9	2.3	2.6

Examples

Experimental Results

Leaderboard
Name	Underrepresented Languages	Background Noise	Crowded Scenes	Mixed Condition	Overall
LoCoNet + TalkNCE	86.7	84.9	84.1	77.9	83.2
LoCoNet	85.8	84.6	80.0	76.2	82.2
TalkNet	80.1	77.6	67.1	70.3	75.7
ASC	74.7	62.9	53.4	57.3	61.4
ASDNet	30.8	17.5	14.8	20.3	20.6

For future models, please contact us so that we can verify your result and update our leaderboard.

Dataset Comparison
Train on \ Evaluation	AVA	Talkies	ASW	UniTalk
AVA	95.5	88.3	88.5	77.5
Talkies	55.7	95.6	84.5	59.9
ASW	29.2	58.8	96.1	33.8
UniTalk	88.0	91.4	90.4	83.2

To compare between different datasets, we use the state-of-the-art model at the current time: LoCoNet + TalkNCE.

Paper and Supplementary Material

Le Thien Phuc Nguyen*, Zhuoran Yu*, Khoa Quang Nhat Cao, Yuwei Guo, Tu Ho Manh Pham, Tuan Tai Nguyen, Toan Ngo Duc Vo, Lucas Poon, Soochahn Lee, Yong Jae Lee (* Equal Contribution)
UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios
(hosted on ArXiv)

[Bibtex]

Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.