We present UniTalk, a novel dataset specifically designed for the task of
active speaker detection, emphasizing challenging scenarios to enhance model generalization.
Unlike previously established benchmarks such as AVA, which predominantly features old movies
and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult
real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded
scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It
contains over 44.5 hours of video with frame-level active speaker annotations across 48,693
speaking identities, and spans a broad range of video types that reflect real-world conditions.
Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect
scores on AVA, fail to reach saturation on UniTalk, suggesting that the ASD task remains far from
solved under realistic conditions. Nevertheless, models trained on UniTalk demonstrate stronger
generalization to modern “in-the-wild” datasets like Talkies and ASW, as well as to AVA. UniTalk
thus establishes a new benchmark for active speaker detection, providing researchers with a
valuable resource for developing and evaluating versatile and resilient models.
|