MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge

1Seoul National University, 2NAVER Cloud AI, 3KAIST AI
* Equal Contribution    † Corresponding Author
ACL 2026 (Main)
Overall architecture of MM-JudgeBias
Overview of the MM-JudgeBias construction and evaluation pipeline. (a–b) We construct image sets from 29 source benchmarks covering four task types and 12 domain types, and then generate queries tailored to our bias evaluation setting through a comprehensive model-and-human verification framework to ensure high quality. (c) A text-only evaluation set is independently constructed via a parallel generation and verification process. (d–e) Based on nine defined bias types, we apply input perturbations to transform original unbiased samples—triplets (Q, I, R) or text-only pairs (Q, R)—into their corresponding biased variants (Q', I', R'). (f–g) Finally, we evaluate 26 MLLM judges using our proposed Bias-Deviation (BD) and Bias-Conformity (BC) metrics to systematically quantify the impact of Compositional Bias on their judgment reliability.

Introduction

MLLMs are increasingly deployed as automatic judges of multimodal generations, yet their reliability as evaluators remains underexplored. We show that many MLLM judges fail to reliably integrate visual and textual evidence, producing unstable or ungrounded scores. We systematically characterize this failure mode as Compositional Bias and introduce MM-JudgeBias, a benchmark that measures it across 9 bias types and 26 state-of-the-art MLLMs.

Beyond serving as task solvers, MLLMs have increasingly been adopted as automatic judges of multimodal generations — a paradigm known as MLLM-as-a-Judge. Given an input triplet of (Image, Query, Response), the judge is tasked with verifying whether the response is grounded in the visual content and faithfully answers the query, which inherently requires holistic integration of all visual–textual cues. We observe, however, that current MLLM judges frequently fail this mandate and instead rely on only a subset of the available cues. We formalize this failure mode as Compositional Bias: a systematic tendency where a judge fails to correctly integrate and reason over multiple components (query, image, and response), and instead relies on partial, misaligned, or spurious compositions.

MM-JudgeBias is a benchmark that quantitatively measures the compositional bias of MLLM judges. We deliberately inject controlled perturbations — each designed to elicit a specific form of compositional bias — into otherwise unbiased input samples, and quantify how the judge's score reacts to these perturbations using two complementary metrics: Bias-Deviation (BD) and Bias-Conformity (BC). The perturbations span three critical dimensions of judgment reliability: Integrality, Congruity, and Robustness. Our dataset contains 1,804 high-difficulty samples drawn from 29 source benchmarks, covering 4 task types and 12 visual domains. We evaluate 26 state-of-the-art MLLMs (closed-source, open-source, and critic models) and reveal that compositional bias is a pervasive, systemic issue, even in the most advanced reasoning-heavy models.

Bias Types

Compositional biases are grouped into three functional dimensions. Integrality and Congruity measure whether a judge penalizes missing or contradictory evidence (higher BD is better). Robustness measures whether a judge remains stable under semantic-preserving perturbations (higher BC is better).

Dimension Bias Type Targeted Reliability Perturbation Strategy Metric
Integrality Text-Dominance Assesses over-reliance on linguistic cues when visual grounding is absent. Replaces the image with a null image (black image). BD ↑
Image-Dominance Tests if the judge ignores the original query in favor of visual content. Replaces the query with a null text (""). BD ↑
Response-Dominance Evaluates if scores are assigned based on response fluency alone, ignoring all context. Replaces both query and image with null inputs. BD ↑
Congruity Instruction-Misalignment Probes query–response conflict by checking for semantic mismatch between the query and response. Replaces the query with a random, unrelated sample. BD ↑
Image-Misalignment Probes image–response conflict by checking for semantic mismatch between the image and response. Replaces the image with a random, unrelated sample. BD ↑
Robustness Detail-Description Examines if providing redundant, aligned visual information inflates the score. Appends a detailed caption of the given image to the query. BC ↑
Unnecessary-Image Tests the ability to recognize and ignore irrelevant visual inputs in text-only tasks. Adds a random, unrelated image to a text-only task. BC ↑
Visual-Transformation Measures invariance to low-level visual transformations that preserve semantics. Applies a diverse set of semantic-preserving augmentations to the image. BC ↑
Texture-Insertion Probes over-sensitivity to textual cues embedded directly within the visual modality. Overlays query-related keywords or the query text itself onto the image. BC ↑

Statistics

Task TypeDomain Type#Samples
Perception / UnderstandingGeneral193
Spatial / Layout / Geometry94
Subtotal287
Information ExtractionText223
Chart / Plot / Diagram150
Table172
Web / App / UI89
Subtotal634
KnowledgeFactual / Commonsense96
Domain Knowledge250
Subtotal346
ReasoningCausal / Logical96
Math176
Code / Symbolic77
Exam188
Subtotal537
Total1,804

Hierarchical composition of the MM-JudgeBias dataset. 1,804 samples are distributed across four functional task types and twelve visual domains. Hover over any segment to see the per-domain sample count and proportion.

Task TypeDomainSource BenchmarksDescription
Perception / UnderstandingGeneralCOCO2017-valRecognition and understanding of objects, attributes, and overall scene content in natural images.
Spatial / LayoutSUN397, Places365, GeoQA+, Geometry3K, UniGeoUnderstanding of spatial relations, geometric structures, and layout arrangements in visual scenes.
Information ExtractionTextTextVQA, DocVQA, CoSynReading and interpretation of textual information embedded in images and documents.
Chart / PlotChartQAPro, ChartBench, AI2D, TQA, InfographicVQA, CoSynInterpretation of structured visualizations such as charts, plots, diagrams, and infographics.
TableMMTab, TabRecSet, DocVQA, CoSynParsing, retrieval, and reasoning over tabular and hierarchically structured visual information.
Web / App / UIScreenSpot, RicoUnderstanding of user interfaces, interface elements, and their functional roles in digital environments.
KnowledgeFactual / CommonsenseA-OKVQA, VCRFactual knowledge and commonsense reasoning beyond directly observable visual evidence.
Domain-SpecificScienceQA, VQA-RAD, ArtBenchSpecialized knowledge in domains such as science, medicine, and fine arts.
ReasoningCausal / LogicalCOCO2017-valCausal relations, logical consistency, and coherent reasoning grounded in visual context.
MathMathVision, MathVista, CoSynMathematical problems, calculations, and visual elements such as diagrams, symbols, or formulas.
Code / SymbolicMMCode, Plot2Code, MMEvalTransformations between visual inputs and code or symbolic representations.
ExamMMMU-ProImages containing exam-style queries or problem statements.

Composition of the MM-JudgeBias dataset. Summary of four functional task types and twelve semantic domains curated from 29 source benchmarks.

Bias TypeEasyModerateHardTotal
Text-Dominance287998205
Image-Dominance2574105204
Response-Dominance2368113204
Instruction-Misalignment2771106204
Image-Misalignment2771104202
Detail-Description2176103200
Unnecessary-Image228095197
Visual-Transformation297096195
Texture-Insertion3457102193
Total2366469221,804

Dataset statistics by bias type and difficulty. Distribution of the 1,804 samples across nine bias-induction strategies and three cognitive difficulty levels (Easy, Moderate, Hard).

Leaderboard

Main experimental results on MM-JudgeBias. The "Avg." column averages across all nine bias types. "Think" indicates models running with internal reasoning; "(high)" denotes high reasoning-effort. Three runs are averaged; we also report inter-run and inter-sample variance. Bold values mark the best per column.

Category Model Think Integrality (BD ↑) Congruity (BD ↑) Robustness (BC ↑) Avg. Var.
(Inter-run)
Var.
(Inter-sample)
TextDom ImgDom RespDom InstrMis ImgMis DetailDesc UnnecImg VisualTrans TextInsert
Closed-sourceGemini-3-Pro (high)0.9120.2780.9880.9820.9550.9360.9470.8890.9330.8690.69.3
Gemini-2.5-Pro0.7510.5350.9781.0000.9040.9350.9240.8610.9300.8690.58.7
Gemini-2.5-Flash0.4230.2920.7430.9970.7600.9050.9440.8640.9180.7610.67.0
Gemini-2.5-Flash-Lite (think)0.3910.3350.8900.9930.6340.8660.9400.8010.8750.7471.17.9
Gemini-2.5-Flash-Lite0.1130.3670.5440.9780.5850.8450.9370.7870.8730.6701.36.5
Gemini-2.0-Flash-Lite0.1620.3580.5700.9240.2170.8970.9580.8620.8780.6470.54.5
o3 (high)0.2760.3540.5960.9860.3370.8790.9370.8220.8840.6750.76.3
o4-mini (high)0.1410.4460.5180.9970.1840.8860.9500.8540.9090.6540.98.5
GPT-5.1 (high)0.1920.2110.2001.0000.2960.9110.9610.8670.9080.6160.67.3
GPT-5 mini (high)0.1120.2360.3370.9910.1850.8980.9540.8770.9270.6130.45.8
GPT-4.1 mini0.0490.1480.1380.9920.1600.8820.9720.8870.9040.5700.54.0
Claude-Opus-4.5 (think)0.6800.5890.9170.9870.9590.9170.9250.8420.9030.8580.25.3
Claude-Sonnet-4.5 (think)0.5560.5400.8310.9870.9050.8840.9210.8250.8930.8160.55.6
Claude-Haiku-4.5 (think)0.2910.5300.7350.9810.8090.8700.9430.7850.8690.7570.65.7
Claude-Opus-4.50.5390.5710.8380.9900.9730.9020.8140.8200.8910.8150.25.5
Claude-Sonnet-4.50.4450.3850.7160.9820.8620.8370.8340.7980.8660.7470.56.3
Claude-Haiku-4.50.2780.3700.6030.9810.8180.8270.8880.7470.8440.7060.76.3
Open-sourceQwen3-VL-30B-A3B-Thinking0.2930.3430.8060.9830.4760.8860.9050.8500.8740.7130.96.7
Qwen3-VL-8B-Thinking0.1770.2100.4120.9910.5290.8830.9270.8650.8980.6550.96.2
Qwen3-VL-30B-A3B-Instruct0.2370.1740.6420.9880.6480.8400.8370.8540.8830.6780.64.6
Qwen3-VL-8B-Instruct0.3360.2660.9131.0000.6220.8030.9520.8700.8800.7380.87.3
Qwen2.5-VL-72B-Instruct0.0820.1580.2230.9890.2080.8220.9030.8550.8530.5660.62.3
Qwen2.5-VL-7B-Instruct0.1410.1880.1940.9910.2770.7350.8150.7390.7670.5391.74.6
InternVL3.5-30B-A3B0.0730.2250.1790.9640.3770.8000.8370.8010.8040.5621.44.1
InternVL3.5-14B0.1370.2430.2730.9820.4640.8030.8500.7970.8140.5961.03.7
InternVL3.5-8B0.0990.1780.2150.9260.2890.7980.8860.7910.8140.5551.03.5
CriticPrometheus-Vision-13B0.1630.3400.3620.8900.1660.7380.7810.8040.8180.5632.410.3
Prometheus-Vision-7B0.1670.2420.2460.8690.1650.7500.7930.8210.8060.5402.410.6
LLaVA-Critic-72B0.1470.1210.3730.9890.2500.9260.9740.9310.9420.6280.02.6
LLaVA-Critic-7B0.2380.2660.4200.9580.4520.8240.9290.8690.8640.6470.06.1
Average0.2870.3170.5470.9760.5160.8560.9050.8350.8740.6790.86.1

Per-Bias Performance

Each panel shows the score of every evaluated judge on a single bias type. Use the arrows to step through all nine bias types.

Text-Dominance

Image-Dominance

Response-Dominance

Instruction-Misalignment

Image-Misalignment

Detail-Description

Unnecessary-Image

Visual-Transformation

Texture-Insertion

Examples

For each of the nine bias types, we show an unbiased instance and its perturbed counterpart. The shared response is rated by judges in both contexts.

Judge Examples

Qualitative examples showing how a single judge model assigns scores under unbiased and biased conditions, illustrating each compositional-bias dimension.

BibTeX

@inproceedings{lee2026mmjudgebias,
      title = "MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge",
      author = "Lee, Sua and Park, Sanghee and Im, Jinbae",
      booktitle = "Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
      year = "2026",
      publisher = "Association for Computational Linguistics",
  }