AV-SpeakerBench is a curated benchmark of 3,212 multiple-choice questions that tests speaker-centric audiovisual reasoning in real-world videos. Unlike prior video datasets where many tasks are visually solvable or only loosely tied to speech, AV-SpeakerBench explicitly evaluates whether models can align who speaks, what is said, and when it happens. Questions are written with fusion-grounded semantics (audio–visual anchors) and expert-curated annotations to ensure temporal and cross-modal correctness. Initial results show that Gemini 2.5 Pro leads overall performance, while the gap between Gemini and strong open models such as Qwen3-Omni-30B highlights persistent weaknesses in audiovisual fusion.
If you have your result. Please contact us so that we can verify and update the leaderboard.
| Model | Type | Overall | Speaker Det. | Speaker Rec. | Speaker Cnt. | Attr. Rec. | Activity Rec. | Visual Cnt. | Speech Rec. | Speech Dur. | Speech Pitch | Speech Rate | Speech Intensity | Audio Cnt. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Human performance | Human | 93.74 | 96.02 | 93.13 | 94.28 | 93.14 | 93.20 | 94.15 | 96.52 | 90.68 | 93.20 | 91.39 | 94.17 | 93.40 |
| Gemini 3 Pro (Thinking) | Proprietary | 77.62 | 85.95 | 79.86 | 73.13 | 80.39 | 79.13 | 71.71 | 87.06 | 75.85 | 70.87 | 78.47 | 75.24 | 70.14 |
| Gemini 2.5 Pro (Thinking) | Proprietary | 73.04 | 81.73 | 74.15 | 74.13 | 72.55 | 73.30 | 62.93 | 77.11 | 78.81 | 67.48 | 69.86 | 71.84 | 63.89 |
| Gemini 2.5 Flash (Thinking) | Proprietary | 67.84 | 74.71 | 70.62 | 60.95 | 65.59 | 70.39 | 65.85 | 78.61 | 69.92 | 66.50 | 67.46 | 65.05 | 58.33 |
| Gemini 2.5 Flash | Proprietary | 60.27 | 69.79 | 68.24 | 50.75 | 61.76 | 68.45 | 51.22 | 71.64 | 58.47 | 60.68 | 60.29 | 59.71 | 40.97 |
| Qwen3-Omni 30B | Open omni | 54.14 | 61.83 | 54.74 | 46.77 | 56.86 | 58.74 | 40.49 | 68.16 | 59.32 | 58.74 | 55.98 | 66.02 | 34.72 |
| Gemini 2.0 Flash | Proprietary | 53.21 | 60.19 | 63.51 | 47.51 | 54.90 | 63.59 | 45.85 | 71.14 | 56.78 | 62.62 | 55.98 | 55.83 | 41.67 |
| Gemini 2.0 Flash-Lite | Proprietary | 51.43 | 56.91 | 57.35 | 38.81 | 49.02 | 55.83 | 45.85 | 67.66 | 52.12 | 52.43 | 51.12 | 56.31 | 33.33 |
| Gemini 2.5 Flash-Lite | Proprietary | 47.23 | 45.90 | 52.44 | 39.10 | 48.53 | 51.94 | 43.41 | 55.72 | 51.69 | 51.46 | 46.89 | 54.37 | 36.11 |
| Qwen2.5-Omni 7B | Open omni | 42.31 | 47.54 | 41.23 | 34.83 | 42.65 | 38.83 | 43.41 | 53.23 | 47.03 | 51.94 | 42.11 | 43.20 | 29.17 |
| Phi-4 Multimodal 5.6B | Open omni | 38.45 | 37.70 | 41.71 | 28.02 | 46.95 | 33.50 | 38.05 | 45.77 | 38.56 | 49.03 | 42.58 | 37.38 | 26.04 |
| Qwen2.5-Omni 3B | Open omni | 38.23 | 44.91 | 41.23 | 33.83 | 44.61 | 45.63 | 44.88 | 45.77 | 50.00 | 42.23 | 32.54 | 39.32 | 26.39 |
| Video-LLaMA2 7B | Open AV | 37.67 | 34.19 | 36.02 | 31.25 | 35.29 | 37.38 | 41.46 | 29.85 | 40.25 | 49.03 | 44.02 | 45.15 | 31.25 |
| VITA-1.5 7B | Open AV | 36.27 | 32.08 | 36.73 | 32.59 | 38.24 | 35.44 | 35.12 | 32.34 | 45.34 | 43.20 | 44.98 | 36.89 | 29.51 |
| VITA 7B | Open AV | 33.66 | 32.08 | 34.60 | 29.85 | 37.25 | 36.41 | 28.78 | 32.84 | 37.29 | 38.83 | 35.89 | 37.86 | 28.13 |
| Video-LLaMA 13B | Open AV | 29.11 | 30.91 | 29.86 | 25.37 | 27.23 | 37.38 | 29.76 | 27.86 | 23.31 | 27.67 | 33.49 | 30.58 | 27.08 |
| Video-LLaMA 7B | Open AV | 28.21 | 29.51 | 26.07 | 29.35 | 27.23 | 32.52 | 25.37 | 31.84 | 28.81 | 31.55 | 28.23 | 27.67 | 21.53 |
| Unified-IO 2 xl 3B | Open omni | 27.52 | 28.81 | 28.44 | 31.25 | 30.39 | 26.21 | 23.41 | 24.88 | 30.93 | 22.82 | 26.32 | 30.58 | 31.25 |
| Unified-IO 2 large 1B | Open omni | 26.15 | 24.36 | 27.49 | 23.88 | 24.51 | 23.79 | 22.93 | 21.89 | 22.03 | 25.24 | 31.58 | 31.55 | 34.38 |
| Unified-IO 2 xxl 7B | Open omni | 24.97 | 26.70 | 27.25 | 24.63 | 30.39 | 22.82 | 29.76 | 23.28 | 35.59 | 25.24 | 33.49 | 37.38 | 28.80 |
| OneLLM 7B | Open omni | 24.97 | 30.44 | 23.70 | 20.65 | 30.88 | 25.24 | 27.80 | 25.37 | 1.69 | 24.27 | 28.23 | 25.73 | 34.72 |
| PandaGPT 7B | Open AV | 22.88 | 28.34 | 20.61 | 27.36 | 16.18 | 20.87 | 32.20 | 33.83 | 19.07 | 24.27 | 17.70 | 14.56 | 15.63 |
| PandaGPT 13B | Open AV | 18.37 | 20.61 | 16.82 | 28.36 | 0.98 | 1.94 | 25.37 | 5.97 | 22.46 | 22.82 | 25.84 | 23.30 | 15.63 |
| AnyGPT 7B | Open omni | 12.67 | 24.12 | 4.27 | 24.88 | 0.98 | 19.90 | 19.51 | 15.42 | 0.00 | 0.00 | 0.48 | 2.43 | 22.92 |
Numbers reflect the latest snapshot; ordered by Overall.
(Left) Visual-only vs. audio gains per task. (Right) Error breakdown by category.
| Model | Type | ≤2 | 3 | 4 | ≥5 |
|---|---|---|---|---|---|
| Gemini 2.5 Pro (Thinking) | Proprietary | 74.8 | 74.1 | 74.1 | 70.9 |
| Gemini 2.5 Flash (Thinking) | Proprietary | 71.8 | 68.4 | 66.8 | 65.1 |
| Qwen3-Omni 30B | Open omni | 58.3 | 52.9 | 52.0 | 54.4 |
| Qwen2.5-Omni 7B | Open omni | 46.7 | 40.7 | 42.4 | 40.2 |
@misc{nguyen2025seehearunderstandbenchmarking,
title={See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models},
author={Le Thien Phuc Nguyen and Zhuoran Yu and Samuel Low Yu Hang and Subin An and Jeongik Lee and Yohan Ban and SeungEun Chung and Thanh-Huy Nguyen and JuWan Maeng and Soochahn Lee and Yong Jae Lee},
year={2025},
eprint={2512.02231},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.02231},
}
This dataset is released under the Creative Commons Attribution–NonCommercial 4.0 International (CC BY-NC 4.0) license. Usage of this dataset requires proper attribution and is restricted to non-commercial purposes.