InfiniBench:

A Benchmark for Large Multi-Modal Models in Long-Form Movies & TV Shows.

InfiniBench skill set comprising eight skills. The right side represents skill categories and question types, while the left side provides examples of both multiple-choice (MCQ) and open-ended questions.

Abstract

Understanding long videos, ranging from tens of minutes to several hours, presents unique challenges in video comprehension. We introduce InfiniBench, a comprehensive benchmark designed to push the limits of extremely long video understanding. InfiniBench presents 1)The longest total video duration, exceeding 1,000 hours, with an average of 52.59 minutes per video; 2) The largest number of question-answer pairs, totaling 111.82 K; 3) Grounding and reasoning questions that require MVLMs to retrieve, structure, and interpret complex video content while establishing causal relationships; 4) Diverse question types spanning eight distinct skills and including both multiple-choice and open-ended formats. We comprehensively evaluate the state-of-the-art Large Multi-Modality Models on each skill, including commercial models such as GPT-4o and Gemini 1.5 Flash and recent open-source models. The results highlight significant challenges, with leading models struggling to achieve high performance.GPT-4o and Gemini 1.5 Flash attained average accuracies of just 49.34\% and 41.99\%, and average scores of 3.25 and 2.79 out of 5, respectively. Qwen2.5VL is the strongest contender among open-source models, nearing Gemini-1.5-Flash in performance.

InfiniBench Leaderboard

Models Frame Rate Grounding Skills Reasoning Skills Avg. Acc. Avg. Score
Global Appearance Scene Transitions Character Actions Chronological Understanding Summarization Deep Context Understanding Spoiler Understanding Linking Events
Baseline Random -- 16.68 16.66 16.14 41.51 -- -- -- -- 22.20 --
GPT-4o 250 FPV 44.51 47.93 36.07 68.85 3.49 3.39 2.67 3.45 49.34 3.25
Gemini-1.5-flash - 42.10 31.63 37.82 56.41 3.24 2.55 2.05 3.33 41.99 2.79
Qwen2.5VL 250 FPV 34.99 36.45 35.09 51.57 1.26 2.35 1.73 3.15 39.53 2.12
Qwen2VL 250 FPV 29.99 37.54 36.86 50.85 0.67 2.07 1.41 2.76 38.81 1.73
LongVU 250 FPV 38.46 22.69 28.97 45.13 0.20 1.10 0.71 1.37 33.81 0.84
LLaVA-OneVision 128 FPV 33.00 25.02 24.83 45.91 0.49 1.78 1.30 2.51 32.19 1.52
InternLM-XComposer-2.5-OL 16 FPW 27.17 24.37 30.09 46.68 0.37 1.21 0.61 2.03 32.08 1.06
InternVL2.5 128 FPV 29.84 25.35 26.41 45.58 0.65 1.48 1.06 2.22 31.80 1.35
InternVL2 128 FPV 24.60 21.98 25.00 44.63 0.69 1.68 1.25 2.47 29.05 1.52
LLaMA-VID 1 FPS 17.37 17.06 18.25 41.74 1.58 2.00 1.49 2.40 23.61 1.87
Goldfish 45 FPW 10.30 2.82 20.87 40.14 0.77 2.36 1.85 3.01 18.53 2.00
MiniGPT4-video 45 FPV 2.33 1.09 2.36 39.86 0.05 0.54 0.75 0.89 11.41 0.56

InfiniBench leaderboard across eight skills. FPV (Frames Per Video), FPS (Frames Per Second), and FPW (Frames Per Window) are reported. All models in this evaluation utilize subtitles.



InfiniBench Vs. Existing video understanding benchmarks.

Category Benchmarks Number Questions Number Videos Avg Video Duration (mins) Total Video Duration (hours) Questions Type QA Source Annotations
MCQ Open Video Transcript Summary Auto Human
Short TGIF-QA 8.5 K 9575 0.05 7.98
MSRVTT-QA 72.8 K 2990 0.25 12.45
MV-Bench 4.0 K 3641 0.27 16.38
Long Activity-QA 8.0 K 800 1.85 24.67
TVQA 15.2 K 2179 1.86 67.55
Egoschema 5.0 K 5063 3.00 253.15
LongVideoBench 6.7 K 3763 7.88 494.21
Moviechat 13.0 K 1000 9.40 156.67
MLVU 3.1 K 1730 15.50 446.92
MoVQA 21.9 K 100 16.53 27.55
Video-MME 2.7 K 900 16.97 254.55
Very Long LVBench 1.6 K 103 68.35 117.33
InfiniBench (Ours) 111.82 K 1219 52.59 1068.45

Comparison between InfiniBench and existing video understanding benchmarks.InfiniBench has the largest QA pairs and the longest total video durations.



Benchmark statistics.

InfiniBench skills statistics. (A) Number of questions per skill, (B) Number of videos per skill, and (C) Average video duration per skill

Comparison between TV shows and Movies. (A) shows the number of questions, (B) represents the number of videos, (C) represents the Total video durations, and (D) shows The Minimum, Maximum, and average video duration for each video source

Full annotation pipeline.

Full annotation pipeline for InfiniBench skill set. The upper section depicts the global appearance pipeline, while the lower section illustrates the question generation using GPT-4o. The gates for video summary and video transcript indicate that some skills utilize only the summary, others use only the transcript, and some use both.

Some qualitative results

Linking events failure and success cases

Example question of global appearance skill and how the models performs on it

Example question of Scene transition skill and how the models performs on it

Linking events failure and success cases

Example question of Spoiler questions skill and how the models performs on it, note that S:number is the GPT4-o score

Example question of Deep context understanding skill and how the models performs on it, note that S:number is the GPT4-o score

Failure and success cases while generating the benchmark

Linking events failure and success cases

Linking multiple events failure and success cases

Temporal order of events failure and success cases

Temporal order of events failure and success cases

Examples

Linking multiple events questions example spoiler questions example Temporal order of events questions example Deep context understanding questions skill Squence of the Character actions questions skill Summarization questions example