CVPR 2026 · Findings

ReFoCUS

Reinforcement-guided Frame Optimization for Contextual Understanding

“Frame selection via a learned policy — not a heuristic. The model wasn’t reasoning wrong. It just never saw the evidence.

Hosu Lee1∗, Junho Kim2∗, Hyunjun Kim1, Yong Man Ro1†

1 KAIST  ·  2 UIUC  ·  ∗ Equal contribution  ·  † Corresponding author

+9.5
Largest single gain
MLVU · Qwen3-VL 8B (63.0→72.5)
13
Video-LLMs improved
from sub-1B open models to GPT-4o
5
Video-QA benchmarks
consistent gains across all
70%
Less peak VRAM
5.3 GB vs 17.7 mDP³ · 19.1 T∗

One selector. Every backbone. Bigger wins.

ReFoCUS is a drop-in frame selector. Swap uniform sampling for its query-conditioned policy and accuracy jumps across the board — the model finally sees the evidence the question points to.

ReFoCUS performance gains across benchmarks and model scales, plus the frame-selection schematic comparing uniform sampling with ReFoCUS evidence frames.

Top: performance gains across five benchmarks at every model scale — from sub-1B open models to proprietary GPT-4o and Gemini 2.5 Flash. Bottom: for the query “what is the spatula-holding hand doing?”, uniform sampling misses the decisive moment, while ReFoCUS selects the sparse evidence frames that actually answer the question before passing them to the Video-LLM.

Abstract

Aligning frame selection with model-internal utility

Recent progress in Large Multi-modal Models has enabled effective vision-language reasoning, yet video understanding remains constrained by suboptimal frame-selection strategies. Prior works rely on static heuristics or external retrieval modules, but these often fail to capture visual cues grounded in the user’s query, conflating raw visual dynamics with true semantic relevance. We introduce ReFoCUS (Reinforcement-guided Frame Optimization for Contextual UnderStanding), the first framework to integrate online policy-gradient reinforcement learning into frame-level optimization for video-LLMs. ReFoCUS learns a frame-selection policy using reward signals derived from reference models, capturing their internal scoring behavior over frame combinations that best support temporally grounded responses. To explore the large combinatorial frame space efficiently, it uses an autoregressive, query-conditional selection architecture that ensures contextual consistency while reducing complexity. The policy needs no explicit frame-level supervision — it implicitly discovers optimal, semantically consistent frame compositions. ReFoCUS consistently improves reasoning accuracy across multiple video-QA benchmarks, demonstrating the advantage of aligning frame selection with model-internal utility.
How it works

“Which frames?” reframed as a reinforcement-learning policy

Instead of optimizing the model’s output, ReFoCUS optimizes its input — learning which frames to feed by training a policy directly against the downstream model’s own confidence.

ReFoCUS framework: the policy model (Vision Encoder + Mamba) autoregressively selects N frames conditioned on the query; a frozen reward model scores each subset by a group-wise prediction margin and updates the policy.

The ReFoCUS loop. A query-conditioned policy πθ (Vision Encoder + Mamba) selects frames one at a time. A frozen reference model rφ scores each candidate subset by a group-wise prediction margin, and the resulting advantage updates the policy — no frame-level labels required.

Query-conditioned autoregressive policy

Built on Video-Ma²mba (state-space backbone, linear complexity). Starting from a <sof> token, the policy πθ picks frames one at a time — each conditioned on the query and previously chosen frames — assembling a coherent, duplicate-free evidence set.

Reward from a frozen reference model

Each candidate subset is scored by a group-relative prediction margin from a frozen rφ (InternVL3): r = tanh((zy* − z)/2) — how confidently it favors the correct answer over the strongest distractor. Input-level alignment, not output-level.

Frame-level policy optimization

Group-normalized advantages (GRPO) update the policy; an entropy bonus replaces KL to keep exploration alive. A search-space curriculum (4→8→16→32) stabilizes learning over a space as large as C(512, 32) ≈ 7×1050.

Results

Plug in ReFoCUS → every model gets better

As a model-agnostic frame selector, ReFoCUS lifts 13 Video-LLMs across 5 benchmarks. Each cell shows base+ReFoCUS with the ▲ gain. Green is a win — and almost everything is green.

Model Video-MME (w/o sub) LongVideo­Bench MLVU Video-MMMU NExT-QA
short medium long overall val m-avg overall val (wups)
Closed-source
Gemini 2.5 Flash 77.679.31.7 63.768.34.6 56.860.84.0 66.069.53.5 47.950.93.0 52.858.05.2 40.645.65.0 11.711.90.2
GPT-4o 68.068.20.2 55.060.15.1 53.354.00.7 58.860.82.0 49.552.93.4 58.765.16.4 62.962.10.8 8.59.10.6
Open-source
LLaVA-OneVision0.5B 53.758.34.6 39.944.64.7 37.038.31.3 43.547.13.6 44.748.74.0 44.850.35.5 17.319.42.1 18.118.90.8
InternVL31B 63.166.43.3 46.951.84.9 39.942.62.7 50.053.63.6 47.650.63.0 54.058.94.9 27.729.31.6 20.020.40.4
VideoLLaMA 32B 55.258.93.7 38.844.15.3 35.238.33.1 43.147.14.0 48.853.74.9 46.850.23.4 28.729.20.5 18.920.71.8
InternVL32B 71.072.21.2 56.460.23.8 47.849.71.9 58.460.72.3 50.954.94.0 62.768.05.3 38.339.31.0 24.425.00.6
InternVL3.54B 76.478.01.6 60.362.32.0 51.357.46.1 62.765.93.2 57.762.64.9 66.671.54.9 52.053.31.3 22.122.90.8
Qwen3-VL4B 74.176.72.6 61.065.74.7 51.357.05.7 62.166.44.3 57.461.94.5 63.171.98.8 54.056.42.4 23.824.10.3
VideoLLaMA 37B 70.472.21.8 57.760.12.4 48.954.35.4 59.062.23.2 54.857.02.2 52.959.86.9 32.834.41.6 25.826.50.7
LLaVA-OneVision7B 70.972.81.9 55.761.76.0 48.853.44.6 58.462.64.2 55.061.06.0 63.768.54.8 34.135.71.6 16.216.40.2
InternVL38B 75.175.80.7 64.466.82.4 53.458.34.9 64.367.02.7 57.862.04.2 68.172.74.6 49.350.61.3 26.626.80.2
InternVL3.58B 77.476.21.2 62.464.92.5 53.258.95.7 64.466.72.3 59.764.14.4 67.370.63.3 50.053.23.2 24.324.70.4
Qwen3-VL8B 75.179.64.5 64.667.02.4 55.358.93.6 65.068.53.5 56.663.36.7 63.072.59.5 59.161.12.0 25.325.70.4

The top line at every frame budget

Against seven competitive frame selectors — Uniform, Frame-Voyager, BOLT, mDP³, TSPO, A.I.R., and K-frames — ReFoCUS is the highest curve at every budget from 4 to 64 frames, on both backbones.

ReFoCUS = top line everywhere Line plots of Video-QA performance versus number of frames (4 to 64) on LLaVA-OV-7B and Qwen2.5-VL-7B. ReFoCUS (orange) is the top line above all seven baselines at every frame budget.

Fig. 4 — Video-MME vs. frame budget. ReFoCUS (orange) sits above Uniform, Frame-Voyager, BOLT, mDP³, TSPO, A.I.R. and K-frames at every budget, on both LLaVA-OV-7B and Qwen2.5-VL-7B. More frames help everyone — ReFoCUS helps most.

Near-perfect recall Visual Needle-in-a-Haystack heatmaps. Left (Uniform) is mostly red/orange, indicating failure to retrieve the needle. Right (ReFoCUS) is uniformly green, indicating near-perfect recall across needle positions and 64 to 1000 frames.

Fig. 7 — Visual Needle-in-a-Haystack. Uniform sampling (left) turns red — it fails to retrieve the needle. ReFoCUS (right) stays green — near-perfect recall across every needle position and 64–1k frames.

Qualitative evidence

It samples the frames the question points to

Four cases where uniform sampling answers wrong, and only ReFoCUS gets it right — because it concentrates its frame budget on the moments that hold the answer.

frame shows the answer evidence · possibly relevant

Video-MME · temporal ordering

In which order are the first four cars driven out of the garage?

(a) yellow · (b) black · (c) silver · (d) white

A. (a)(b)(c)(d). B. (a)(c)(b)(d). C. (b)(c)(d)(a). D. (b)(d)(a)(c).
Uniform C. (b)(c)(d)(a) Wrong
Frame 1 of 32 (Uniform) for case 479-1, at 0:00.
0:00
Frame 2 of 32 (Uniform) for case 479-1, at 0:21.
0:21
Frame 3 of 32 (Uniform) for case 479-1, at 0:42.
0:42
Frame 4 of 32 (Uniform) for case 479-1, at 1:04.
1:04
Frame 5 of 32 (Uniform) for case 479-1, at 1:25.
1:25
Frame 6 of 32 (Uniform) for case 479-1, at 1:47.
1:47
Frame 7 of 32 (Uniform) for case 479-1, at 2:08.
2:08
Frame 8 of 32 (Uniform) for case 479-1, at 2:30.
2:30
Frame 9 of 32 (Uniform) for case 479-1, at 2:51.
2:51
Frame 10 of 32 (Uniform) for case 479-1, at 3:13.
3:13
Frame 11 of 32 (Uniform) for case 479-1, at 3:34.
3:34
Frame 12 of 32 (Uniform) for case 479-1, at 3:55.
3:55
Frame 13 of 32 (Uniform) for case 479-1, at 4:17.
4:17
Frame 14 of 32 (Uniform) for case 479-1, at 4:38.
4:38
Frame 15 of 32 (Uniform) for case 479-1, at 5:00.
5:00
Frame 16 of 32 (Uniform) for case 479-1, at 5:21.
5:21
Frame 17 of 32 (Uniform) for case 479-1, at 5:43.
5:43
Frame 18 of 32 (Uniform) for case 479-1, at 6:04.
6:04
Frame 19 of 32 (Uniform) for case 479-1, at 6:26.
6:26
Frame 20 of 32 (Uniform) for case 479-1, at 6:47.
6:47
Frame 21 of 32 (Uniform) for case 479-1, at 7:08.
7:08
Frame 22 of 32 (Uniform) for case 479-1, at 7:30.
7:30
Frame 23 of 32 (Uniform) for case 479-1, at 7:51.
7:51
Frame 24 of 32 (Uniform) for case 479-1, at 8:13.
8:13
Frame 25 of 32 (Uniform) for case 479-1, at 8:34.
8:34
Frame 26 of 32 (Uniform) for case 479-1, at 8:56.
8:56
Frame 27 of 32 (Uniform) for case 479-1, at 9:17.
9:17
Frame 28 of 32 (Uniform) for case 479-1, at 9:39.
9:39
Frame 29 of 32 (Uniform) for case 479-1, at 10:00.
10:00
Frame 30 of 32 (Uniform) for case 479-1, at 10:21.
10:21
Frame 31 of 32 (Uniform) for case 479-1, at 10:43.
10:43
Frame 32 of 32 (Uniform) for case 479-1, at 11:04.
11:04
0:00evenly spaced · 32 frames11:04
ReFoCUS B. (a)(c)(b)(d) Correct
Frame 1 of 32 (ReFoCUS) for case 479-1, at 0:01 — shows the answer evidence.
0:01
Frame 2 of 32 (ReFoCUS) for case 479-1, at 0:03 — shows the answer evidence.
0:03
Frame 3 of 32 (ReFoCUS) for case 479-1, at 0:06 — shows the answer evidence.
0:06
Frame 4 of 32 (ReFoCUS) for case 479-1, at 0:07 — shows the answer evidence.
0:07
Frame 5 of 32 (ReFoCUS) for case 479-1, at 0:09 — shows the answer evidence.
0:09
Frame 6 of 32 (ReFoCUS) for case 479-1, at 0:10 — shows the answer evidence.
0:10
Frame 7 of 32 (ReFoCUS) for case 479-1, at 0:11 — shows the answer evidence.
0:11
Frame 8 of 32 (ReFoCUS) for case 479-1, at 0:14 — shows the answer evidence.
0:14
Frame 9 of 32 (ReFoCUS) for case 479-1, at 0:15 — shows the answer evidence.
0:15
Frame 10 of 32 (ReFoCUS) for case 479-1, at 0:16 — shows the answer evidence.
0:16
Frame 11 of 32 (ReFoCUS) for case 479-1, at 0:18 — shows the answer evidence.
0:18
Frame 12 of 32 (ReFoCUS) for case 479-1, at 0:19 — shows the answer evidence.
0:19
Frame 13 of 32 (ReFoCUS) for case 479-1, at 0:20 — shows the answer evidence.
0:20
Frame 14 of 32 (ReFoCUS) for case 479-1, at 0:22 — shows the answer evidence.
0:22
Frame 15 of 32 (ReFoCUS) for case 479-1, at 0:23 — shows the answer evidence.
0:23
Frame 16 of 32 (ReFoCUS) for case 479-1, at 0:24 — shows the answer evidence.
0:24
Frame 17 of 32 (ReFoCUS) for case 479-1, at 0:26 — shows the answer evidence.
0:26
Frame 18 of 32 (ReFoCUS) for case 479-1, at 0:27 — shows the answer evidence.
0:27
Frame 19 of 32 (ReFoCUS) for case 479-1, at 0:28 — shows the answer evidence.
0:28
Frame 20 of 32 (ReFoCUS) for case 479-1, at 0:29 — shows the answer evidence.
0:29
Frame 21 of 32 (ReFoCUS) for case 479-1, at 0:36 — shows the answer evidence.
0:36
Frame 22 of 32 (ReFoCUS) for case 479-1, at 0:55.
0:55
Frame 23 of 32 (ReFoCUS) for case 479-1, at 4:43.
4:43
Frame 24 of 32 (ReFoCUS) for case 479-1, at 5:48.
5:48
Frame 25 of 32 (ReFoCUS) for case 479-1, at 7:00.
7:00
Frame 26 of 32 (ReFoCUS) for case 479-1, at 7:13.
7:13
Frame 27 of 32 (ReFoCUS) for case 479-1, at 7:17.
7:17
Frame 28 of 32 (ReFoCUS) for case 479-1, at 7:43.
7:43
Frame 29 of 32 (ReFoCUS) for case 479-1, at 8:13.
8:13
Frame 30 of 32 (ReFoCUS) for case 479-1, at 8:20.
8:20
Frame 31 of 32 (ReFoCUS) for case 479-1, at 8:22.
8:22
Frame 32 of 32 (ReFoCUS) for case 479-1, at 8:23.
8:23
0:00clusters on the opening garage sequence11:04
The cars all leave in the first 30 seconds. ReFoCUS packs ~20 frames into the opening garage sequence and reads the order yellow → silver → black → white, while uniform sampling spreads evenly across the 11-minute clip and shuffles the sequence.
Video-MME · temporal grounding

What do the expanding red lines on the map in the first few minutes of the video stand for?

A. The Yellow River. B. The Silk Road. C. Du Fu’s route to Xi’an. D. The Yangtze River.
Uniform C. Du Fu’s route Wrong
Frame 1 of 32 (Uniform) for case 618-1, at 0:00.
0:00
Frame 2 of 32 (Uniform) for case 618-1, at 1:54.
1:54
Frame 3 of 32 (Uniform) for case 618-1, at 3:48.
3:48
Frame 4 of 32 (Uniform) for case 618-1, at 5:42.
5:42
Frame 5 of 32 (Uniform) for case 618-1, at 7:36.
7:36
Frame 6 of 32 (Uniform) for case 618-1, at 9:30.
9:30
Frame 7 of 32 (Uniform) for case 618-1, at 11:25.
11:25
Frame 8 of 32 (Uniform) for case 618-1, at 13:19.
13:19
Frame 9 of 32 (Uniform) for case 618-1, at 15:13.
15:13
Frame 10 of 32 (Uniform) for case 618-1, at 17:07.
17:07
Frame 11 of 32 (Uniform) for case 618-1, at 19:01.
19:01
Frame 12 of 32 (Uniform) for case 618-1, at 20:56.
20:56
Frame 13 of 32 (Uniform) for case 618-1, at 22:50.
22:50
Frame 14 of 32 (Uniform) for case 618-1, at 24:44.
24:44
Frame 15 of 32 (Uniform) for case 618-1, at 26:38.
26:38
Frame 16 of 32 (Uniform) for case 618-1, at 28:32.
28:32
Frame 17 of 32 (Uniform) for case 618-1, at 30:27.
30:27
Frame 18 of 32 (Uniform) for case 618-1, at 32:21.
32:21
Frame 19 of 32 (Uniform) for case 618-1, at 34:15.
34:15
Frame 20 of 32 (Uniform) for case 618-1, at 36:09.
36:09
Frame 21 of 32 (Uniform) for case 618-1, at 38:03.
38:03
Frame 22 of 32 (Uniform) for case 618-1, at 39:58.
39:58
Frame 23 of 32 (Uniform) for case 618-1, at 41:52.
41:52
Frame 24 of 32 (Uniform) for case 618-1, at 43:46.
43:46
Frame 25 of 32 (Uniform) for case 618-1, at 45:40.
45:40
Frame 26 of 32 (Uniform) for case 618-1, at 47:34.
47:34
Frame 27 of 32 (Uniform) for case 618-1, at 49:29.
49:29
Frame 28 of 32 (Uniform) for case 618-1, at 51:23.
51:23
Frame 29 of 32 (Uniform) for case 618-1, at 53:17.
53:17
Frame 30 of 32 (Uniform) for case 618-1, at 55:11.
55:11
Frame 31 of 32 (Uniform) for case 618-1, at 57:05.
57:05
Frame 32 of 32 (Uniform) for case 618-1, at 59:00.
59:00
0:00evenly spaced · 32 frames59:00
ReFoCUS B. The Silk Road Correct
Frame 1 of 32 (ReFoCUS) for case 618-1, at 0:00.
0:00
Frame 2 of 32 (ReFoCUS) for case 618-1, at 0:13.
0:13
Frame 3 of 32 (ReFoCUS) for case 618-1, at 0:20.
0:20
Frame 4 of 32 (ReFoCUS) for case 618-1, at 0:27.
0:27
Frame 5 of 32 (ReFoCUS) for case 618-1, at 0:34.
0:34
Frame 6 of 32 (ReFoCUS) for case 618-1, at 0:41.
0:41
Frame 7 of 32 (ReFoCUS) for case 618-1, at 0:48.
0:48
Frame 8 of 32 (ReFoCUS) for case 618-1, at 1:09.
1:09
Frame 9 of 32 (ReFoCUS) for case 618-1, at 1:16.
1:16
Frame 10 of 32 (ReFoCUS) for case 618-1, at 1:36.
1:36
Frame 11 of 32 (ReFoCUS) for case 618-1, at 1:43.
1:43
Frame 12 of 32 (ReFoCUS) for case 618-1, at 1:50.
1:50
Frame 13 of 32 (ReFoCUS) for case 618-1, at 1:57.
1:57
Frame 14 of 32 (ReFoCUS) for case 618-1, at 2:04.
2:04
Frame 15 of 32 (ReFoCUS) for case 618-1, at 2:11.
2:11
Frame 16 of 32 (ReFoCUS) for case 618-1, at 2:18.
2:18
Frame 17 of 32 (ReFoCUS) for case 618-1, at 2:25.
2:25
Frame 18 of 32 (ReFoCUS) for case 618-1, at 3:00 — shows the answer evidence.
3:00
Frame 19 of 32 (ReFoCUS) for case 618-1, at 3:07 — shows the answer evidence.
3:07
Frame 20 of 32 (ReFoCUS) for case 618-1, at 3:13.
3:13
Frame 21 of 32 (ReFoCUS) for case 618-1, at 3:48.
3:48
Frame 22 of 32 (ReFoCUS) for case 618-1, at 4:37.
4:37
Frame 23 of 32 (ReFoCUS) for case 618-1, at 4:44.
4:44
Frame 24 of 32 (ReFoCUS) for case 618-1, at 4:50.
4:50
Frame 25 of 32 (ReFoCUS) for case 618-1, at 4:57.
4:57
Frame 26 of 32 (ReFoCUS) for case 618-1, at 5:04.
5:04
Frame 27 of 32 (ReFoCUS) for case 618-1, at 5:11.
5:11
Frame 28 of 32 (ReFoCUS) for case 618-1, at 7:23.
7:23
Frame 29 of 32 (ReFoCUS) for case 618-1, at 8:11.
8:11
Frame 30 of 32 (ReFoCUS) for case 618-1, at 10:23.
10:23
Frame 31 of 32 (ReFoCUS) for case 618-1, at 12:07 — possibly relevant.
12:07
Frame 32 of 32 (ReFoCUS) for case 618-1, at 31:38 — possibly relevant.
31:38
0:00concentrated on the early map segment59:00
The evidence is in the first few minutes. ReFoCUS packs its budget into the opening map sequence and reads the expanding red lines as the Silk Road, while uniform sampling spreads frames across the hour-long video and never looks closely.
Video-MME · counting

What is the total number of people in the video?

A. 7. B. 6. C. 5. D. 8.
Uniform B. 6 Wrong
Frame 1 of 32 (Uniform) for case 206-1, at 0:00.
0:00
Frame 2 of 32 (Uniform) for case 206-1, at 0:02.
0:02
Frame 3 of 32 (Uniform) for case 206-1, at 0:04.
0:04
Frame 4 of 32 (Uniform) for case 206-1, at 0:07.
0:07
Frame 5 of 32 (Uniform) for case 206-1, at 0:09.
0:09
Frame 6 of 32 (Uniform) for case 206-1, at 0:11.
0:11
Frame 7 of 32 (Uniform) for case 206-1, at 0:14.
0:14
Frame 8 of 32 (Uniform) for case 206-1, at 0:16.
0:16
Frame 9 of 32 (Uniform) for case 206-1, at 0:18.
0:18
Frame 10 of 32 (Uniform) for case 206-1, at 0:21.
0:21
Frame 11 of 32 (Uniform) for case 206-1, at 0:23.
0:23
Frame 12 of 32 (Uniform) for case 206-1, at 0:25.
0:25
Frame 13 of 32 (Uniform) for case 206-1, at 0:28.
0:28
Frame 14 of 32 (Uniform) for case 206-1, at 0:30.
0:30
Frame 15 of 32 (Uniform) for case 206-1, at 0:32.
0:32
Frame 16 of 32 (Uniform) for case 206-1, at 0:35.
0:35
Frame 17 of 32 (Uniform) for case 206-1, at 0:37.
0:37
Frame 18 of 32 (Uniform) for case 206-1, at 0:39.
0:39
Frame 19 of 32 (Uniform) for case 206-1, at 0:42.
0:42
Frame 20 of 32 (Uniform) for case 206-1, at 0:44.
0:44
Frame 21 of 32 (Uniform) for case 206-1, at 0:46.
0:46
Frame 22 of 32 (Uniform) for case 206-1, at 0:49 — shows the answer evidence.
0:49
Frame 23 of 32 (Uniform) for case 206-1, at 0:51 — shows the answer evidence.
0:51
Frame 24 of 32 (Uniform) for case 206-1, at 0:53 — shows the answer evidence.
0:53
Frame 25 of 32 (Uniform) for case 206-1, at 0:56 — shows the answer evidence.
0:56
Frame 26 of 32 (Uniform) for case 206-1, at 0:58 — shows the answer evidence.
0:58
Frame 27 of 32 (Uniform) for case 206-1, at 1:00.
1:00
Frame 28 of 32 (Uniform) for case 206-1, at 1:03.
1:03
Frame 29 of 32 (Uniform) for case 206-1, at 1:05.
1:05
Frame 30 of 32 (Uniform) for case 206-1, at 1:08.
1:08
Frame 31 of 32 (Uniform) for case 206-1, at 1:10.
1:10
Frame 32 of 32 (Uniform) for case 206-1, at 1:12.
1:12
0:00evenly spaced · 32 frames1:12
ReFoCUS A. 7 Correct
Frame 1 of 32 (ReFoCUS) for case 206-1, at 0:22.
0:22
Frame 2 of 32 (ReFoCUS) for case 206-1, at 0:47 — shows the answer evidence.
0:47
Frame 3 of 32 (ReFoCUS) for case 206-1, at 0:48 — shows the answer evidence.
0:48
Frame 4 of 32 (ReFoCUS) for case 206-1, at 0:48 — shows the answer evidence.
0:48
Frame 5 of 32 (ReFoCUS) for case 206-1, at 0:48 — shows the answer evidence.
0:48
Frame 6 of 32 (ReFoCUS) for case 206-1, at 0:48 — shows the answer evidence.
0:48
Frame 7 of 32 (ReFoCUS) for case 206-1, at 0:49 — shows the answer evidence.
0:49
Frame 8 of 32 (ReFoCUS) for case 206-1, at 0:49 — shows the answer evidence.
0:49
Frame 9 of 32 (ReFoCUS) for case 206-1, at 0:49 — shows the answer evidence.
0:49
Frame 10 of 32 (ReFoCUS) for case 206-1, at 0:49 — shows the answer evidence.
0:49
Frame 11 of 32 (ReFoCUS) for case 206-1, at 0:50 — shows the answer evidence.
0:50
Frame 12 of 32 (ReFoCUS) for case 206-1, at 0:50 — shows the answer evidence.
0:50
Frame 13 of 32 (ReFoCUS) for case 206-1, at 0:50 — shows the answer evidence.
0:50
Frame 14 of 32 (ReFoCUS) for case 206-1, at 0:50 — shows the answer evidence.
0:50
Frame 15 of 32 (ReFoCUS) for case 206-1, at 0:51 — shows the answer evidence.
0:51
Frame 16 of 32 (ReFoCUS) for case 206-1, at 0:51 — shows the answer evidence.
0:51
Frame 17 of 32 (ReFoCUS) for case 206-1, at 0:52 — shows the answer evidence.
0:52
Frame 18 of 32 (ReFoCUS) for case 206-1, at 0:52 — shows the answer evidence.
0:52
Frame 19 of 32 (ReFoCUS) for case 206-1, at 0:53 — shows the answer evidence.
0:53
Frame 20 of 32 (ReFoCUS) for case 206-1, at 0:53 — shows the answer evidence.
0:53
Frame 21 of 32 (ReFoCUS) for case 206-1, at 0:54 — shows the answer evidence.
0:54
Frame 22 of 32 (ReFoCUS) for case 206-1, at 0:54 — shows the answer evidence.
0:54
Frame 23 of 32 (ReFoCUS) for case 206-1, at 0:54 — shows the answer evidence.
0:54
Frame 24 of 32 (ReFoCUS) for case 206-1, at 0:55 — shows the answer evidence.
0:55
Frame 25 of 32 (ReFoCUS) for case 206-1, at 0:55 — shows the answer evidence.
0:55
Frame 26 of 32 (ReFoCUS) for case 206-1, at 0:56 — shows the answer evidence.
0:56
Frame 27 of 32 (ReFoCUS) for case 206-1, at 0:56 — shows the answer evidence.
0:56
Frame 28 of 32 (ReFoCUS) for case 206-1, at 0:57 — shows the answer evidence.
0:57
Frame 29 of 32 (ReFoCUS) for case 206-1, at 0:57 — shows the answer evidence.
0:57
Frame 30 of 32 (ReFoCUS) for case 206-1, at 0:58 — shows the answer evidence.
0:58
Frame 31 of 32 (ReFoCUS) for case 206-1, at 0:58 — shows the answer evidence.
0:58
Frame 32 of 32 (ReFoCUS) for case 206-1, at 0:58 — shows the answer evidence.
0:58
0:00locks onto the full-group shot (~0:47–0:58)1:12
Counting needs the right shot, not many shots. ReFoCUS zooms into the window where everyone is on screen together and counts seven; the uniformly-spread baseline never catches the full group.
Video-MME · reading on-screen text

The video shows how long it takes to drive from the Earth to the Moon?

A. 160 days. B. 50 days. C. 180 days. D. 19 days.
Uniform B. 50 days Wrong
Frame 1 of 32 (Uniform) for case 345-1, at 0:00.
0:00
Frame 2 of 32 (Uniform) for case 345-1, at 0:21.
0:21
Frame 3 of 32 (Uniform) for case 345-1, at 0:43.
0:43
Frame 4 of 32 (Uniform) for case 345-1, at 1:04.
1:04
Frame 5 of 32 (Uniform) for case 345-1, at 1:26.
1:26
Frame 6 of 32 (Uniform) for case 345-1, at 1:47.
1:47
Frame 7 of 32 (Uniform) for case 345-1, at 2:09.
2:09
Frame 8 of 32 (Uniform) for case 345-1, at 2:31.
2:31
Frame 9 of 32 (Uniform) for case 345-1, at 2:52.
2:52
Frame 10 of 32 (Uniform) for case 345-1, at 3:14.
3:14
Frame 11 of 32 (Uniform) for case 345-1, at 3:35.
3:35
Frame 12 of 32 (Uniform) for case 345-1, at 3:57.
3:57
Frame 13 of 32 (Uniform) for case 345-1, at 4:19.
4:19
Frame 14 of 32 (Uniform) for case 345-1, at 4:40.
4:40
Frame 15 of 32 (Uniform) for case 345-1, at 5:02.
5:02
Frame 16 of 32 (Uniform) for case 345-1, at 5:23.
5:23
Frame 17 of 32 (Uniform) for case 345-1, at 5:45.
5:45
Frame 18 of 32 (Uniform) for case 345-1, at 6:06.
6:06
Frame 19 of 32 (Uniform) for case 345-1, at 6:28.
6:28
Frame 20 of 32 (Uniform) for case 345-1, at 6:50.
6:50
Frame 21 of 32 (Uniform) for case 345-1, at 7:11.
7:11
Frame 22 of 32 (Uniform) for case 345-1, at 7:33.
7:33
Frame 23 of 32 (Uniform) for case 345-1, at 7:54.
7:54
Frame 24 of 32 (Uniform) for case 345-1, at 8:16.
8:16
Frame 25 of 32 (Uniform) for case 345-1, at 8:38.
8:38
Frame 26 of 32 (Uniform) for case 345-1, at 8:59.
8:59
Frame 27 of 32 (Uniform) for case 345-1, at 9:21.
9:21
Frame 28 of 32 (Uniform) for case 345-1, at 9:42.
9:42
Frame 29 of 32 (Uniform) for case 345-1, at 10:04.
10:04
Frame 30 of 32 (Uniform) for case 345-1, at 10:25.
10:25
Frame 31 of 32 (Uniform) for case 345-1, at 10:47.
10:47
Frame 32 of 32 (Uniform) for case 345-1, at 11:09.
11:09
0:00evenly spaced · 32 frames11:09
ReFoCUS A. 160 days Correct
Frame 1 of 32 (ReFoCUS) for case 345-1, at 0:36.
0:36
Frame 2 of 32 (ReFoCUS) for case 345-1, at 0:44.
0:44
Frame 3 of 32 (ReFoCUS) for case 345-1, at 0:45.
0:45
Frame 4 of 32 (ReFoCUS) for case 345-1, at 0:47.
0:47
Frame 5 of 32 (ReFoCUS) for case 345-1, at 0:51.
0:51
Frame 6 of 32 (ReFoCUS) for case 345-1, at 0:52.
0:52
Frame 7 of 32 (ReFoCUS) for case 345-1, at 0:53.
0:53
Frame 8 of 32 (ReFoCUS) for case 345-1, at 0:54 — shows the answer evidence.
0:54
Frame 9 of 32 (ReFoCUS) for case 345-1, at 0:56 — shows the answer evidence.
0:56
Frame 10 of 32 (ReFoCUS) for case 345-1, at 1:08.
1:08
Frame 11 of 32 (ReFoCUS) for case 345-1, at 1:09.
1:09
Frame 12 of 32 (ReFoCUS) for case 345-1, at 1:18.
1:18
Frame 13 of 32 (ReFoCUS) for case 345-1, at 1:21.
1:21
Frame 14 of 32 (ReFoCUS) for case 345-1, at 1:23.
1:23
Frame 15 of 32 (ReFoCUS) for case 345-1, at 1:34.
1:34
Frame 16 of 32 (ReFoCUS) for case 345-1, at 1:35.
1:35
Frame 17 of 32 (ReFoCUS) for case 345-1, at 1:36.
1:36
Frame 18 of 32 (ReFoCUS) for case 345-1, at 1:44.
1:44
Frame 19 of 32 (ReFoCUS) for case 345-1, at 1:46.
1:46
Frame 20 of 32 (ReFoCUS) for case 345-1, at 1:48.
1:48
Frame 21 of 32 (ReFoCUS) for case 345-1, at 1:49.
1:49
Frame 22 of 32 (ReFoCUS) for case 345-1, at 2:21.
2:21
Frame 23 of 32 (ReFoCUS) for case 345-1, at 2:31.
2:31
Frame 24 of 32 (ReFoCUS) for case 345-1, at 2:33.
2:33
Frame 25 of 32 (ReFoCUS) for case 345-1, at 2:34.
2:34
Frame 26 of 32 (ReFoCUS) for case 345-1, at 3:09.
3:09
Frame 27 of 32 (ReFoCUS) for case 345-1, at 4:48.
4:48
Frame 28 of 32 (ReFoCUS) for case 345-1, at 4:50.
4:50
Frame 29 of 32 (ReFoCUS) for case 345-1, at 4:51.
4:51
Frame 30 of 32 (ReFoCUS) for case 345-1, at 4:53.
4:53
Frame 31 of 32 (ReFoCUS) for case 345-1, at 4:54.
4:54
Frame 32 of 32 (ReFoCUS) for case 345-1, at 5:58.
5:58
0:00focuses where the figure is shown on screen11:09
The answer is a number on the screen. ReFoCUS samples the exact segment where the “160 days” figure appears; uniform sampling skims past it and guesses.
CVPR Findings poster

The whole story on one board

Click to open the full-resolution PDF Full CVPR 2026 Findings poster for ReFoCUS, with Problem, Method, and Results columns.
Cite

BibTeX

@InProceedings{Lee_2026_CVPR,
    author    = {Lee, Hosu and Kim, Junho and Kim, Hyunjun and Ro, Yong Man},
    title     = {ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings},
    month     = {June},
    year      = {2026},
    pages     = {8291-8302}
}