InfiniBench skill set comprising eight skills. The right side represents skill categories and question types, while the left side provides examples of both multiple-choice (MCQ) and open-ended questions.
Understanding long videos, ranging from tens of minutes to several hours, presents unique challenges in video comprehension. We introduce InfiniBench, a comprehensive benchmark designed to push the limits of extremely long video understanding. InfiniBench presents 1)The longest total video duration, exceeding 1,000 hours, with an average of 52.59 minutes per video; 2) The largest number of question-answer pairs, totaling 111.82 K; 3) Grounding and reasoning questions that require MVLMs to retrieve, structure, and interpret complex video content while establishing causal relationships; 4) Diverse question types spanning eight distinct skills and including both multiple-choice and open-ended formats. We comprehensively evaluate the state-of-the-art Large Multi-Modality Models on each skill, including commercial models such as GPT-4o and Gemini 1.5 Flash and recent open-source models. The results highlight significant challenges, with leading models struggling to achieve high performance.GPT-4o and Gemini 1.5 Flash attained average accuracies of just 49.34\% and 41.99\%, and average scores of 3.25 and 2.79 out of 5, respectively. Qwen2.5VL is the strongest contender among open-source models, nearing Gemini-1.5-Flash in performance.
Models | Frame Rate | Grounding Skills | Reasoning Skills | Avg. Acc. | Avg. Score | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Global Appearance | Scene Transitions | Character Actions | Chronological Understanding | Summarization | Deep Context Understanding | Spoiler Understanding | Linking Events | ||||
Baseline Random | -- | 16.68 | 16.66 | 16.14 | 41.51 | -- | -- | -- | -- | 22.20 | -- |
GPT-4o | 250 FPV | 44.51 | 47.93 | 36.07 | 68.85 | 3.49 | 3.39 | 2.67 | 3.45 | 49.34 | 3.25 |
Gemini-1.5-flash | - | 42.10 | 31.63 | 37.82 | 56.41 | 3.24 | 2.55 | 2.05 | 3.33 | 41.99 | 2.79 |
Qwen2.5VL | 250 FPV | 34.99 | 36.45 | 35.09 | 51.57 | 1.26 | 2.35 | 1.73 | 3.15 | 39.53 | 2.12 |
Qwen2VL | 250 FPV | 29.99 | 37.54 | 36.86 | 50.85 | 0.67 | 2.07 | 1.41 | 2.76 | 38.81 | 1.73 |
LongVU | 250 FPV | 38.46 | 22.69 | 28.97 | 45.13 | 0.20 | 1.10 | 0.71 | 1.37 | 33.81 | 0.84 |
LLaVA-OneVision | 128 FPV | 33.00 | 25.02 | 24.83 | 45.91 | 0.49 | 1.78 | 1.30 | 2.51 | 32.19 | 1.52 |
InternLM-XComposer-2.5-OL | 16 FPW | 27.17 | 24.37 | 30.09 | 46.68 | 0.37 | 1.21 | 0.61 | 2.03 | 32.08 | 1.06 |
InternVL2.5 | 128 FPV | 29.84 | 25.35 | 26.41 | 45.58 | 0.65 | 1.48 | 1.06 | 2.22 | 31.80 | 1.35 |
InternVL2 | 128 FPV | 24.60 | 21.98 | 25.00 | 44.63 | 0.69 | 1.68 | 1.25 | 2.47 | 29.05 | 1.52 |
LLaMA-VID | 1 FPS | 17.37 | 17.06 | 18.25 | 41.74 | 1.58 | 2.00 | 1.49 | 2.40 | 23.61 | 1.87 |
Goldfish | 45 FPW | 10.30 | 2.82 | 20.87 | 40.14 | 0.77 | 2.36 | 1.85 | 3.01 | 18.53 | 2.00 |
MiniGPT4-video | 45 FPV | 2.33 | 1.09 | 2.36 | 39.86 | 0.05 | 0.54 | 0.75 | 0.89 | 11.41 | 0.56 |
InfiniBench leaderboard across eight skills. FPV (Frames Per Video), FPS (Frames Per Second), and FPW (Frames Per Window) are reported. All models in this evaluation utilize subtitles.
Category | Benchmarks | Number Questions | Number Videos | Avg Video Duration (mins) | Total Video Duration (hours) | Questions Type | QA Source | Annotations | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
MCQ | Open | Video | Transcript | Summary | Auto | Human | ||||||
Short | TGIF-QA | 8.5 K | 9575 | 0.05 | 7.98 | ✘ | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ |
MSRVTT-QA | 72.8 K | 2990 | 0.25 | 12.45 | ✘ | ✔ | ✔ | ✘ | ✘ | ✔ | ✘ | |
MV-Bench | 4.0 K | 3641 | 0.27 | 16.38 | ✔ | ✘ | ✔ | ✘ | ✘ | ✔ | ✘ | |
Long | Activity-QA | 8.0 K | 800 | 1.85 | 24.67 | ✘ | ✔ | ✔ | ✘ | ✘ | ✘ | ✔ |
TVQA | 15.2 K | 2179 | 1.86 | 67.55 | ✔ | ✘ | ✔ | ✘ | ✘ | ✘ | ✔ | |
Egoschema | 5.0 K | 5063 | 3.00 | 253.15 | ✔ | ✘ | ✔ | ✘ | ✘ | ✔ | ✔ | |
LongVideoBench | 6.7 K | 3763 | 7.88 | 494.21 | ✔ | ✘ | ✔ | ✘ | ✘ | ✘ | ✔ | |
Moviechat | 13.0 K | 1000 | 9.40 | 156.67 | ✘ | ✔ | ✔ | ✘ | ✘ | ✘ | ✔ | |
MLVU | 3.1 K | 1730 | 15.50 | 446.92 | ✔ | ✔ | ✔ | ✘ | ✘ | ✔ | ✔ | |
MoVQA | 21.9 K | 100 | 16.53 | 27.55 | ✔ | ✘ | ✔ | ✘ | ✘ | ✘ | ✔ | |
Video-MME | 2.7 K | 900 | 16.97 | 254.55 | ✔ | ✘ | ✔ | ✘ | ✘ | ✘ | ✔ | |
Very Long | LVBench | 1.6 K | 103 | 68.35 | 117.33 | ✔ | ✘ | ✔ | ✘ | ✘ | ✘ | ✔ |
InfiniBench (Ours) | 111.82 K | 1219 | 52.59 | 1068.45 | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Comparison between InfiniBench and existing video understanding benchmarks.InfiniBench has the largest QA pairs and the longest total video durations.
InfiniBench skills statistics. (A) Number of questions per skill, (B) Number of videos per skill, and (C) Average video duration per skill
Comparison between TV shows and Movies. (A) shows the number of questions, (B) represents the number of videos, (C) represents the Total video durations, and (D) shows The Minimum, Maximum, and average video duration for each video source
Full annotation pipeline for InfiniBench skill set. The upper section depicts the global appearance pipeline, while the lower section illustrates the question generation using GPT-4o. The gates for video summary and video transcript indicate that some skills utilize only the summary, others use only the transcript, and some use both.