We propose ReasonX, a novel framework for MLLM-guided improvement of intrinsic decomposition models
via relative intrinsic judgments on RGB input images.
Main Contributions
Relative MLLM Judge
A point-pair MLLM judge trained on synthetic data to answer modality-specific questions
about depth, albedo, irradiance and normals via relative comparisons rather than absolute labels.
Intrinsic-GRPO Framework
A GRPO-based training scheme that treats judge–intrinsic agreement as a reward signal,
enabling ground-truth-free refinement on diverse, unlabeled real images.
Improved Generalization
Improved generalization to in-the-wild scenes across different intrinsic modalities
(albedo, depth, normals, and irradiance) and multiple base models.
Abstract
Intrinsic image decomposition aims to separate images into physical components such as albedo, depth, normals,
and illumination. While recent diffusion- and transformer-based models benefit from paired supervision from synthetic
datasets, their generalization to diverse, real-world scenarios remains challenging. We propose ReasonX,
a novel framework that leverages a multimodal large language model (MLLM) as a perceptual judge providing relative
intrinsic comparisons, and uses these comparisons as GRPO rewards for fine-tuning intrinsic decomposition models on
unlabeled, in-the-wild images. Unlike RL methods for generative models, our framework aligns conditional
intrinsic predictors by rewarding agreement between the judge's relational assessments and analytically derived
relations from the model's outputs. ReasonX is model-agnostic and can be applied to different intrinsic
predictors. Across multiple base architectures and modalities, ReasonX yields significant improvements,
including 9–25% WHDR reduction on IIW albedo and up to 46% depth accuracy gains on ETH3D, highlighting the promise
of MLLM-guided comparative supervision to bridge low- and high-level vision reasoning.
Method
Overview of our ReasonX framework. (a) We fine-tune an MLLM to judge relative intrinsic properties from RGB images
using sampled point pairs. (b) The frozen judge then provides rewards within a GRPO loop to refine an intrinsic
decomposition model π: for each RGB image, we generate a group of G = 8 samples, query the
judge across point pairs and modalities, and compute group-relative rewards to update π without
ground-truth intrinsics.
Quantitative Results
IIW albedo benchmark. WHDR at 10% and 20% thresholds.
ReasonX significantly improves both Marigold and PRISM, with PRISM-X achieving 25% WHDR
reduction and matching methods trained directly on IIW.
Zero-shot depth estimation.
PRISM-X achieves 45.8% AbsRel improvement over its base model on ETH3D,
reaching results comparable to SOTA despite training on a fraction of the data.
Qualitative Comparisons
Drag the sliders to compare base model outputs (left) with their ReasonX-enhanced variants (right).
Each scene shows all four intrinsic channels side by side.
Christmas MarketPRISM → PRISM-X
Input RGB
PRISMPRISM-X
Albedo
PRISMPRISM-X
Irradiance
PRISMPRISM-X
Normals
PRISMPRISM-X
Depth
Darth VaderPRISM → PRISM-X
Input RGB
PRISMPRISM-X
Albedo
PRISMPRISM-X
Irradiance
PRISMPRISM-X
Normals
PRISMPRISM-X
Depth
Dome InteriorPRISM → PRISM-X
Input RGB
PRISMPRISM-X
Albedo
PRISMPRISM-X
Irradiance
PRISMPRISM-X
Normals
PRISMPRISM-X
Depth
London Marigold → Marigold-X
Input RGB
MarigoldMarigold-X
Albedo
MarigoldMarigold-X
Irradiance
BridgeMarigold → Marigold-X
Input RGB
MarigoldMarigold-X
Albedo
MarigoldMarigold-X
Irradiance
Marigold IID Lighting v1.1 estimates albedo and irradiance only.
Additional Results
Surface normals. Comparisons of PRISM-X with its base model PRISM and Marigold
Normals v1.1 on NYUv2 and DIODE samples.
Zero-shot depth. Comparisons of PRISM-X with PRISM and Marigold Depth v1.0 on
NYUv2, DIODE and ETH3D. PRISM-X performs significantly better on challenging images.
Alara Dirik, Tuanfeng Wang, Duygu Ceylan, Stefanos Zafeiriou, Anna Frühstück
arXiv preprint, 2025
BibTeX
@inproceedings{Dirik2025ReasonXMI,
title = {ReasonX: MLLM-Guided Intrinsic Image Decomposition},
author = {Dirik, Alara and Wang, Tuanfeng and Ceylan, Duygu
and Zafeiriou, Stefanos and Fr{\"u}hst{\"u}ck, Anna},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR)},
year = {2026}
}