Reinforcement-guided Frame Optimization for Contextual Understanding
“Frame selection via a learned policy — not a heuristic. The model wasn’t reasoning wrong. It just never saw the evidence.”
1 KAIST · 2 UIUC · ∗ Equal contribution · † Corresponding author
ReFoCUS is a drop-in frame selector. Swap uniform sampling for its query-conditioned policy and accuracy jumps across the board — the model finally sees the evidence the question points to.
Top: performance gains across five benchmarks at every model scale — from sub-1B open models to proprietary GPT-4o and Gemini 2.5 Flash. Bottom: for the query “what is the spatula-holding hand doing?”, uniform sampling misses the decisive moment, while ReFoCUS selects the sparse evidence frames that actually answer the question before passing them to the Video-LLM.
Instead of optimizing the model’s output, ReFoCUS optimizes its input — learning which frames to feed by training a policy directly against the downstream model’s own confidence.
The ReFoCUS loop. A query-conditioned policy πθ (Vision Encoder + Mamba)
selects frames one at a time. A frozen reference model rφ scores each candidate
subset by a group-wise prediction margin, and the resulting advantage updates the policy — no
frame-level labels required.
Built on Video-Ma²mba (state-space backbone, linear complexity). Starting from a <sof>
token, the policy πθ picks frames one at a time — each conditioned on the
query and previously chosen frames — assembling a coherent, duplicate-free evidence set.
Each candidate subset is scored by a group-relative prediction margin from a frozen
rφ (InternVL3): r = tanh((zy* − zỹ)/2) —
how confidently it favors the correct answer over the strongest distractor. Input-level alignment, not output-level.
Group-normalized advantages (GRPO) update the policy; an entropy bonus replaces KL to keep exploration alive. A search-space curriculum (4→8→16→32) stabilizes learning over a space as large as C(512, 32) ≈ 7×1050.
As a model-agnostic frame selector, ReFoCUS lifts 13 Video-LLMs across 5 benchmarks. Each cell shows base → +ReFoCUS with the ▲ gain. Green is a win — and almost everything is green.
| Model | Video-MME (w/o sub) | LongVideoBench | MLVU | Video-MMMU | NExT-QA | |||
|---|---|---|---|---|---|---|---|---|
| short | medium | long | overall | val | m-avg | overall | val (wups) | |
| Closed-source | ||||||||
| Gemini 2.5 Flash | 77.6→79.31.7 | 63.7→68.34.6 | 56.8→60.84.0 | 66.0→69.53.5 | 47.9→50.93.0 | 52.8→58.05.2 | 40.6→45.65.0 | 11.7→11.90.2 |
| GPT-4o | 68.0→68.20.2 | 55.0→60.15.1 | 53.3→54.00.7 | 58.8→60.82.0 | 49.5→52.93.4 | 58.7→65.16.4 | 62.9→62.10.8 | 8.5→9.10.6 |
| Open-source | ||||||||
| LLaVA-OneVision0.5B | 53.7→58.34.6 | 39.9→44.64.7 | 37.0→38.31.3 | 43.5→47.13.6 | 44.7→48.74.0 | 44.8→50.35.5 | 17.3→19.42.1 | 18.1→18.90.8 |
| InternVL31B | 63.1→66.43.3 | 46.9→51.84.9 | 39.9→42.62.7 | 50.0→53.63.6 | 47.6→50.63.0 | 54.0→58.94.9 | 27.7→29.31.6 | 20.0→20.40.4 |
| VideoLLaMA 32B | 55.2→58.93.7 | 38.8→44.15.3 | 35.2→38.33.1 | 43.1→47.14.0 | 48.8→53.74.9 | 46.8→50.23.4 | 28.7→29.20.5 | 18.9→20.71.8 |
| InternVL32B | 71.0→72.21.2 | 56.4→60.23.8 | 47.8→49.71.9 | 58.4→60.72.3 | 50.9→54.94.0 | 62.7→68.05.3 | 38.3→39.31.0 | 24.4→25.00.6 |
| InternVL3.54B | 76.4→78.01.6 | 60.3→62.32.0 | 51.3→57.46.1 | 62.7→65.93.2 | 57.7→62.64.9 | 66.6→71.54.9 | 52.0→53.31.3 | 22.1→22.90.8 |
| Qwen3-VL4B | 74.1→76.72.6 | 61.0→65.74.7 | 51.3→57.05.7 | 62.1→66.44.3 | 57.4→61.94.5 | 63.1→71.98.8 | 54.0→56.42.4 | 23.8→24.10.3 |
| VideoLLaMA 37B | 70.4→72.21.8 | 57.7→60.12.4 | 48.9→54.35.4 | 59.0→62.23.2 | 54.8→57.02.2 | 52.9→59.86.9 | 32.8→34.41.6 | 25.8→26.50.7 |
| LLaVA-OneVision7B | 70.9→72.81.9 | 55.7→61.76.0 | 48.8→53.44.6 | 58.4→62.64.2 | 55.0→61.06.0 | 63.7→68.54.8 | 34.1→35.71.6 | 16.2→16.40.2 |
| InternVL38B | 75.1→75.80.7 | 64.4→66.82.4 | 53.4→58.34.9 | 64.3→67.02.7 | 57.8→62.04.2 | 68.1→72.74.6 | 49.3→50.61.3 | 26.6→26.80.2 |
| InternVL3.58B | 77.4→76.21.2 | 62.4→64.92.5 | 53.2→58.95.7 | 64.4→66.72.3 | 59.7→64.14.4 | 67.3→70.63.3 | 50.0→53.23.2 | 24.3→24.70.4 |
| Qwen3-VL8B | 75.1→79.64.5 | 64.6→67.02.4 | 55.3→58.93.6 | 65.0→68.53.5 | 56.6→63.36.7 | 63.0→72.59.5 | 59.1→61.12.0 | 25.3→25.70.4 |
Against seven competitive frame selectors — Uniform, Frame-Voyager, BOLT, mDP³, TSPO, A.I.R., and K-frames — ReFoCUS is the highest curve at every budget from 4 to 64 frames, on both backbones.
Fig. 4 — Video-MME vs. frame budget. ReFoCUS (orange) sits above Uniform, Frame-Voyager, BOLT, mDP³, TSPO, A.I.R. and K-frames at every budget, on both LLaVA-OV-7B and Qwen2.5-VL-7B. More frames help everyone — ReFoCUS helps most.
Fig. 7 — Visual Needle-in-a-Haystack. Uniform sampling (left) turns red — it fails to retrieve the needle. ReFoCUS (right) stays green — near-perfect recall across every needle position and 64–1k frames.
Four cases where uniform sampling answers wrong, and only ReFoCUS gets it right — because it concentrates its frame budget on the moments that hold the answer.
frame shows the answer evidence · possibly relevant
(a) yellow · (b) black · (c) silver · (d) white
































































































































































































































































@InProceedings{Lee_2026_CVPR, author = {Lee, Hosu and Kim, Junho and Kim, Hyunjun and Ro, Yong Man}, title = {ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings}, month = {June}, year = {2026}, pages = {8291-8302} }