Overview

ReasonX overview
We propose ReasonX, a novel framework for MLLM-guided improvement of intrinsic decomposition models via relative intrinsic judgments on RGB input images.

Main Contributions

Relative MLLM Judge

A point-pair MLLM judge trained on synthetic data to answer modality-specific questions about depth, albedo, irradiance and normals via relative comparisons rather than absolute labels.

Intrinsic-GRPO Framework

A GRPO-based training scheme that treats judge–intrinsic agreement as a reward signal, enabling ground-truth-free refinement on diverse, unlabeled real images.

Improved Generalization

Improved generalization to in-the-wild scenes across different intrinsic modalities (albedo, depth, normals, and irradiance) and multiple base models.

Abstract

Intrinsic image decomposition aims to separate images into physical components such as albedo, depth, normals, and illumination. While recent diffusion- and transformer-based models benefit from paired supervision from synthetic datasets, their generalization to diverse, real-world scenarios remains challenging. We propose ReasonX, a novel framework that leverages a multimodal large language model (MLLM) as a perceptual judge providing relative intrinsic comparisons, and uses these comparisons as GRPO rewards for fine-tuning intrinsic decomposition models on unlabeled, in-the-wild images. Unlike RL methods for generative models, our framework aligns conditional intrinsic predictors by rewarding agreement between the judge's relational assessments and analytically derived relations from the model's outputs. ReasonX is model-agnostic and can be applied to different intrinsic predictors. Across multiple base architectures and modalities, ReasonX yields significant improvements, including 9–25% WHDR reduction on IIW albedo and up to 46% depth accuracy gains on ETH3D, highlighting the promise of MLLM-guided comparative supervision to bridge low- and high-level vision reasoning.

Method

ReasonX pipeline

Overview of our ReasonX framework. (a) We fine-tune an MLLM to judge relative intrinsic properties from RGB images using sampled point pairs. (b) The frozen judge then provides rewards within a GRPO loop to refine an intrinsic decomposition model π: for each RGB image, we generate a group of G = 8 samples, query the judge across point pairs and modalities, and compute group-relative rewards to update π without ground-truth intrinsics.

Quantitative Results

IIW albedo benchmark
IIW albedo benchmark. WHDR at 10% and 20% thresholds. ReasonX significantly improves both Marigold and PRISM, with PRISM-X achieving 25% WHDR reduction and matching methods trained directly on IIW.
Depth estimation on NYUv2 and ETH3D
Zero-shot depth estimation. PRISM-X achieves 45.8% AbsRel improvement over its base model on ETH3D, reaching results comparable to SOTA despite training on a fraction of the data.

Qualitative Comparisons

Drag the sliders to compare base model outputs (left) with their ReasonX-enhanced variants (right). Each scene shows all four intrinsic channels side by side.

Christmas Market PRISM → PRISM-X
Input RGB Input RGB
PRISM PRISM-X
PRISM PRISM-X

Albedo

PRISM PRISM-X
PRISM PRISM-X

Irradiance

PRISM PRISM-X
PRISM PRISM-X

Normals

PRISM PRISM-X
PRISM PRISM-X

Depth

Darth Vader PRISM → PRISM-X
Input RGB Input RGB
PRISM PRISM-X
PRISM PRISM-X

Albedo

PRISM PRISM-X
PRISM PRISM-X

Irradiance

PRISM PRISM-X
PRISM PRISM-X

Normals

PRISM PRISM-X
PRISM PRISM-X

Depth

Dome Interior PRISM → PRISM-X
Input RGB Input RGB
PRISM PRISM-X
PRISM PRISM-X

Albedo

PRISM PRISM-X
PRISM PRISM-X

Irradiance

PRISM PRISM-X
PRISM PRISM-X

Normals

PRISM PRISM-X
PRISM PRISM-X

Depth

London Marigold → Marigold-X
Input RGB Input RGB
Marigold Marigold-X
Marigold Marigold-X

Albedo

Marigold Marigold-X
Marigold Marigold-X

Irradiance

Bridge Marigold → Marigold-X
Input RGB Input RGB
Marigold Marigold-X
Marigold Marigold-X

Albedo

Marigold Marigold-X
Marigold Marigold-X

Irradiance

Marigold IID Lighting v1.1 estimates albedo and irradiance only.

Additional Results

Surface normals by PRISM-X
Surface normals. Comparisons of PRISM-X with its base model PRISM and Marigold Normals v1.1 on NYUv2 and DIODE samples.
Depth maps by PRISM-X
Zero-shot depth. Comparisons of PRISM-X with PRISM and Marigold Depth v1.0 on NYUv2, DIODE and ETH3D. PRISM-X performs significantly better on challenging images.

Related Publications

BibTeX

@inproceedings{Dirik2025ReasonXMI,
  title     = {ReasonX: MLLM-Guided Intrinsic Image Decomposition},
  author    = {Dirik, Alara and Wang, Tuanfeng and Ceylan, Duygu
               and Zafeiriou, Stefanos and Fr{\"u}hst{\"u}ck, Anna},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer
               Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}