InfiniBench

InfiniBench skill set comprising eight skills. The right side represents skill categories and question types, while the left side provides examples of both multiple-choice (MCQ) and open-ended questions.

Abstract

Understanding long videos, ranging from tens of minutes to several hours, presents unique challenges in video comprehension. We introduce InfiniBench, a comprehensive benchmark designed to push the limits of extremely long video understanding. InfiniBench presents 1)The longest total video duration, exceeding 1,000 hours, with an average of 52.59 minutes per video; 2) The largest number of question-answer pairs, totaling 111.82 K; 3) Grounding and reasoning questions that require MVLMs to retrieve, structure, and interpret complex video content while establishing causal relationships; 4) Diverse question types spanning eight distinct skills and including both multiple-choice and open-ended formats. We comprehensively evaluate the state-of-the-art Large Multi-Modality Models on each skill, including commercial models such as GPT-4o and Gemini 1.5 Flash and recent open-source models. The results highlight significant challenges, with leading models struggling to achieve high performance.GPT-4o and Gemini 1.5 Flash attained average accuracies of just 49.34\% and 41.99\%, and average scores of 3.25 and 2.79 out of 5, respectively. Qwen2.5VL is the strongest contender among open-source models, nearing Gemini-1.5-Flash in performance.

InfiniBench Leaderboard

Models	Frame Rate	Grounding Skills				Reasoning Skills				Avg. Acc.	Avg. Score
Models	Frame Rate	Global Appearance	Scene Transitions	Character Actions	Chronological Understanding	Summarization	Deep Context Understanding	Spoiler Understanding	Linking Events	Avg. Acc.	Avg. Score
Baseline Random	--	16.68	16.66	16.14	41.51	--	--	--	--	22.20	--
GPT-4o	250 FPV	44.51	47.93	36.07	68.85	3.49	3.39	2.67	3.45	49.34	3.25
Gemini-1.5-flash	-	42.10	31.63	37.82	56.41	3.24	2.55	2.05	3.33	41.99	2.79
Qwen2.5VL	250 FPV	34.99	36.45	35.09	51.57	1.26	2.35	1.73	3.15	39.53	2.12
Qwen2VL	250 FPV	29.99	37.54	36.86	50.85	0.67	2.07	1.41	2.76	38.81	1.73
LongVU	250 FPV	38.46	22.69	28.97	45.13	0.20	1.10	0.71	1.37	33.81	0.84
LLaVA-OneVision	128 FPV	33.00	25.02	24.83	45.91	0.49	1.78	1.30	2.51	32.19	1.52
InternLM-XComposer-2.5-OL	16 FPW	27.17	24.37	30.09	46.68	0.37	1.21	0.61	2.03	32.08	1.06
InternVL2.5	128 FPV	29.84	25.35	26.41	45.58	0.65	1.48	1.06	2.22	31.80	1.35
InternVL2	128 FPV	24.60	21.98	25.00	44.63	0.69	1.68	1.25	2.47	29.05	1.52
LLaMA-VID	1 FPS	17.37	17.06	18.25	41.74	1.58	2.00	1.49	2.40	23.61	1.87
Goldfish	45 FPW	10.30	2.82	20.87	40.14	0.77	2.36	1.85	3.01	18.53	2.00
MiniGPT4-video	45 FPV	2.33	1.09	2.36	39.86	0.05	0.54	0.75	0.89	11.41	0.56

InfiniBench leaderboard across eight skills. FPV (Frames Per Video), FPS (Frames Per Second), and FPW (Frames Per Window) are reported. All models in this evaluation utilize subtitles.

InfiniBench Vs. Existing video understanding benchmarks.

Category	Benchmarks	Number Questions	Number Videos	Avg Video Duration (mins)	Total Video Duration (hours)	Questions Type		QA Source			Annotations
Category	Benchmarks	Number Questions	Number Videos	Avg Video Duration (mins)	Total Video Duration (hours)	MCQ	Open	Video	Transcript	Summary	Auto	Human
Short	TGIF-QA	8.5 K	9575	0.05	7.98	✘	✔	✔	✘	✘	✔	✔
	MSRVTT-QA	72.8 K	2990	0.25	12.45	✘	✔	✔	✘	✘	✔	✘
	MV-Bench	4.0 K	3641	0.27	16.38	✔	✘	✔	✘	✘	✔	✘
Long	Activity-QA	8.0 K	800	1.85	24.67	✘	✔	✔	✘	✘	✘	✔
	TVQA	15.2 K	2179	1.86	67.55	✔	✘	✔	✘	✘	✘	✔
	Egoschema	5.0 K	5063	3.00	253.15	✔	✘	✔	✘	✘	✔	✔
	LongVideoBench	6.7 K	3763	7.88	494.21	✔	✘	✔	✘	✘	✘	✔
	Moviechat	13.0 K	1000	9.40	156.67	✘	✔	✔	✘	✘	✘	✔
	MLVU	3.1 K	1730	15.50	446.92	✔	✔	✔	✘	✘	✔	✔
	MoVQA	21.9 K	100	16.53	27.55	✔	✘	✔	✘	✘	✘	✔
	Video-MME	2.7 K	900	16.97	254.55	✔	✘	✔	✘	✘	✘	✔
Very Long	LVBench	1.6 K	103	68.35	117.33	✔	✘	✔	✘	✘	✘	✔
Very Long	InfiniBench (Ours)	111.82 K	1219	52.59	1068.45	✔	✔	✔	✔	✔	✔	✔

Comparison between InfiniBench and existing video understanding benchmarks.InfiniBench has the largest QA pairs and the longest total video durations.

Benchmark statistics.

InfiniBench skills statistics. (A) Number of questions per skill, (B) Number of videos per skill, and (C) Average video duration per skill

Comparison between TV shows and Movies. (A) shows the number of questions, (B) represents the number of videos, (C) represents the Total video durations, and (D) shows The Minimum, Maximum, and average video duration for each video source

Full annotation pipeline.

Full annotation pipeline for InfiniBench skill set. The upper section depicts the global appearance pipeline, while the lower section illustrates the question generation using GPT-4o. The gates for video summary and video transcript indicate that some skills utilize only the summary, others use only the transcript, and some use both.

InfiniBench:

A Benchmark for Large Multi-Modal Models in Long-Form Movies & TV Shows.

InfiniBench skill set comprising eight skills. The right side represents skill categories and question types, while the left side provides examples of both multiple-choice (MCQ) and open-ended questions.

Abstract

InfiniBench Leaderboard

InfiniBench Vs. Existing video understanding benchmarks.

Comparison between InfiniBench and existing video understanding benchmarks.InfiniBench has the largest QA pairs and the longest total video durations.

Benchmark statistics.

InfiniBench skills statistics. (A) Number of questions per skill, (B) Number of videos per skill, and (C) Average video duration per skill

Comparison between TV shows and Movies. (A) shows the number of questions, (B) represents the number of videos, (C) represents the Total video durations, and (D) shows The Minimum, Maximum, and average video duration for each video source

Full annotation pipeline.

Some qualitative results

Example question of global appearance skill and how the models performs on it

Example question of Scene transition skill and how the models performs on it

Example question of Spoiler questions skill and how the models performs on it, note that S:number is the GPT4-o score

Example question of Deep context understanding skill and how the models performs on it, note that S:number is the GPT4-o score

Failure and success cases while generating the benchmark

Linking multiple events failure and success cases

Temporal order of events failure and success cases

Examples