AV-SpeakerBench icon

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Le Thien Phuc Nguyen*, Zhuoran Yu*, Samuel Low Yu Hang, Subin An, Jeongkik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, Juwan Maeng, Soochahn Lee, Yong Jae Lee
( * equal contribution )

Introduction

AV-SpeakerBench is a curated benchmark of 3,212 multiple-choice questions that tests speaker-centric audiovisual reasoning in real-world videos. Unlike prior video datasets where many tasks are visually solvable or only loosely tied to speech, AV-SpeakerBench explicitly evaluates whether models can align who speaks, what is said, and when it happens. Questions are written with fusion-grounded semantics (audio–visual anchors) and expert-curated annotations to ensure temporal and cross-modal correctness. Initial results show that Gemini 2.5 Pro leads overall performance, while the gap between Gemini and strong open models such as Qwen3-Omni-30B highlights persistent weaknesses in audiovisual fusion.

Peek at the data

AV-SpeakerBench question design
AV-SpeakerBench dataset statistics

Leaderboard (per sub-category, MCQ accuracy)

If you have your result. Please contact us so that we can verify and update the leaderboard.

Model Type Overall Speaker Det. Speaker Rec. Speaker Cnt. Attr. Rec. Activity Rec. Visual Cnt. Speech Rec. Speech Dur. Speech Pitch Speech Rate Speech Intensity Audio Cnt.
Human performance Human 93.7496.0293.1394.2893.1493.2094.1596.5290.6893.2091.3994.1793.40
Gemini 3 Pro (Thinking) Proprietary 77.6285.9579.8673.1380.3979.1371.7187.0675.8570.8778.4775.2470.14
Gemini 2.5 Pro (Thinking) Proprietary 73.0481.7374.1574.1372.5573.3062.9377.1178.8167.4869.8671.8463.89
Gemini 2.5 Flash (Thinking) Proprietary 67.8474.7170.6260.9565.5970.3965.8578.6169.9266.5067.4665.0558.33
Gemini 2.5 Flash Proprietary 60.2769.7968.2450.7561.7668.4551.2271.6458.4760.6860.2959.7140.97
Qwen3-Omni 30B Open omni 54.1461.8354.7446.7756.8658.7440.4968.1659.3258.7455.9866.0234.72
Gemini 2.0 Flash Proprietary 53.2160.1963.5147.5154.9063.5945.8571.1456.7862.6255.9855.8341.67
Gemini 2.0 Flash-Lite Proprietary 51.4356.9157.3538.8149.0255.8345.8567.6652.1252.4351.1256.3133.33
Gemini 2.5 Flash-Lite Proprietary 47.2345.9052.4439.1048.5351.9443.4155.7251.6951.4646.8954.3736.11
Qwen2.5-Omni 7B Open omni 42.3147.5441.2334.8342.6538.8343.4153.2347.0351.9442.1143.2029.17
Phi-4 Multimodal 5.6B Open omni 38.4537.7041.7128.0246.9533.5038.0545.7738.5649.0342.5837.3826.04
Qwen2.5-Omni 3B Open omni 38.2344.9141.2333.8344.6145.6344.8845.7750.0042.2332.5439.3226.39
Video-LLaMA2 7B Open AV 37.6734.1936.0231.2535.2937.3841.4629.8540.2549.0344.0245.1531.25
VITA-1.5 7B Open AV 36.2732.0836.7332.5938.2435.4435.1232.3445.3443.2044.9836.8929.51
VITA 7B Open AV 33.6632.0834.6029.8537.2536.4128.7832.8437.2938.8335.8937.8628.13
Video-LLaMA 13B Open AV 29.1130.9129.8625.3727.2337.3829.7627.8623.3127.6733.4930.5827.08
Video-LLaMA 7B Open AV 28.2129.5126.0729.3527.2332.5225.3731.8428.8131.5528.2327.6721.53
Unified-IO 2 xl 3B Open omni 27.5228.8128.4431.2530.3926.2123.4124.8830.9322.8226.3230.5831.25
Unified-IO 2 large 1B Open omni 26.1524.3627.4923.8824.5123.7922.9321.8922.0325.2431.5831.5534.38
Unified-IO 2 xxl 7B Open omni 24.9726.7027.2524.6330.3922.8229.7623.2835.5925.2433.4937.3828.80
OneLLM 7B Open omni 24.9730.4423.7020.6530.8825.2427.8025.371.6924.2728.2325.7334.72
PandaGPT 7B Open AV 22.8828.3420.6127.3616.1820.8732.2033.8319.0724.2717.7014.5615.63
PandaGPT 13B Open AV 18.3720.6116.8228.360.981.9425.375.9722.4622.8225.8423.3015.63
AnyGPT 7B Open omni 12.6724.124.2724.880.9819.9019.5115.420.000.000.482.4322.92

Numbers reflect the latest snapshot; ordered by Overall.

Ablations & Error Analysis

Modality ablation across task types
Error type distribution across benchmark categories

(Left) Visual-only vs. audio gains per task. (Right) Error breakdown by category.

Performance vs. # of Visible People

Model Type ≤2 3 4 ≥5
Gemini 2.5 Pro (Thinking) Proprietary 74.874.174.170.9
Gemini 2.5 Flash (Thinking) Proprietary 71.868.466.865.1
Qwen3-Omni 30B Open omni 58.352.952.054.4
Qwen2.5-Omni 7B Open omni 46.740.742.440.2

Citation

@misc{nguyen2025seehearunderstandbenchmarking,
      title={See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models}, 
      author={Le Thien Phuc Nguyen and Zhuoran Yu and Samuel Low Yu Hang and Subin An and Jeongik Lee and Yohan Ban and SeungEun Chung and Thanh-Huy Nguyen and JuWan Maeng and Soochahn Lee and Yong Jae Lee},
      year={2025},
      eprint={2512.02231},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.02231}, 
}

License

This dataset is released under the Creative Commons Attribution–NonCommercial 4.0 International (CC BY-NC 4.0) license. Usage of this dataset requires proper attribution and is restricted to non-commercial purposes.