THEMIS Project

Abstract

We present THEMIS, a novel multi-task benchmark designed to comprehensively evaluate multimodal large language models (MLLMs) on visual fraud reasoning within real-world academic scenarios. Compared to existing benchmarks, THEMIS introduces three major advances. (1) Real-World Scenarios and Complexity: Our benchmark comprises over 4,000 questions spanning seven scenarios, derived from authentic retracted-paper cases and carefully curated multimodal synthetic data. With 60.47% complex-texture images, THEMIS bridges the critical gap between existing benchmarks and the complexity of real-world academic fraud. (2) Fraud-Type Diversity and Granularity: THEMIS systematically covers five challenging fraud types and introduces 16 fine-grained manipulation operations. On average, each sample undergoes multiple stacked manipulation operations, with the diversity and difficulty of these manipulations demanding a high level of visual fraud reasoning from the models. (3) MultiDimensional Capability Evaluation: We establish a mapping from fraud types to five core visual fraud reasoning capabilities, thereby enabling an evaluation that reveals the distinct strengths and specific weaknesses of different models across these core capabilities. Experiments on 16 leading MLLMs show that even the best-performing model, GPT-5, achieves an overall performance of only 56.15%, demonstrating that our benchmark presents a stringent test. We expect THEMIS to advance the development of MLLMs for complex, real-world fraud reasoning tasks.

Findings

01

Overall performance remains limited. Across all evaluated tasks, MLLMs exhibit significant room for improvement, with even the SOTA GPT-5 reaching a peak BRI of only 56.15%. Notably, the open-source Qwen2.5-VL-72B achieved 47.16%, performing on par with several proprietary models and highlighting the competitive potential of open-weight architectures.

02

Models exhibit pronounced specialization. Most MLLMs excel in one or two subtasks but fall short in others, revealing substantial imbalance. This specialization highlights the lack of integrated visual fraud reasoning capabilities in current models.

03

Localization is markedly harder. Performance consistently declined across all MLLMs when shifting from Single-Mode Forgery Identification task to Localization task, with GPT-5 dropping by 55% and OpenAI o4-mini-high by 61%. By contrast, Gemini 2.5 Flash exhibited only minor declines (14%), indicating relatively stronger spatial perceptual capacity

04

Limitations in cross-modal alignment. Although they demonstrate relatively high judgment accuracy on Text–Image Inconsistency subtask, they struggle to ground these judgments in specific textual spans. This limitation may stem from an overreliance on global semantic associations while lacking sufficient modeling of local cross-modal mappings.

01

Lack of transformation sensitivity. Models perform reasonably on direct reuse but degrade markedly under geometric transformations or appearance adjustments. They often fail to detect duplication and cannot determine the transformation type, underscoring the limited spatial reasoning ability of current MLLMs and their insufficient robustness to transformations.

02

Current models lack robustness to input perturbations. To test MLLM sensitivity on synthetic data, we applied Gaussian blur, JPEG compression, and scaling. Gaussian blur caused the steepest drops across tasks, while JPEG compression and scaling also degraded performance.

03

Insufficient edge perception. Models perform better on Copy-Move than Splicing, as the former can exploit both edge anomalies and region similarity, while the latter depends mainly on boundary cues.

04

Synthetic fraud data exhibit a level of deceptiveness comparable to real fraud data. In the Composite Manipulation Operations Identification task, the synthetic data impose even greater identification pressure than real data. In the Single-Mode Forgery Identification and Localization task, their difficulty is similarly on par with real data. The main exception lies in Splicing subtask, where real data remain more challenging due to the sophistication and subtlety of splicing patterns in real-world manipulations.

Benchmark

Dataset construction pipeline of THEMIS. The dataset is built through 2 stages: Stage 1: Extraction and Parsing, where figures, captions, and related sentences are parsed from scientific PDFs and segmented into panels; Stage 2: Fraud Data Generation, where 5 major fraud types (Splicing, Copy-Move, AI-Generated, Duplication, and Text–Image Inconsistency) are applied to construct challenging tasks.

Evaluation task design of THEMIS. A principled mapping from 5 fraud types to 5 core reasoning capabilities (Expert Knowledge Utilization, Visual Recognition, Spatial Reasoning, Region Localization, and Comparative Reasoning). The capability distribution bars on the right of each box illustrate the reasoning skills involved and their relative emphasis, with the darkest color highlighting the primary capability being evaluated.

Statistics of THEMIS. (a) Distribution of fraud methods. (b) Distribution of manipulation operations (synthetic data). See Table 5 for real cases. IIF: Image Inference Forgery; TRE: Targeted Region Editing; CT: Color Temperature; DR: Direct Reuse; HF: Horizontal Flip; VF: Vertical Flip.

Can You Spot the Fraud?

A) Click the buttons to identify the forgery type.

C) Are the text and image consistent?

Figure Caption:

EDS analysis at the grain boundaries of Fe-6.5 wt % Si steel strip samples: (a) 1.0 wt % Cu, (b) 1.5 wt % Cu, and (c) 2.0 wt % Cu.

Related Sentences:

Further SEM microstructural examination revealed that Cu rich precipitates were absent in the 1.5 wt % Cu specimen (Figure 8a), but became visible in the 1.5 wt % Cu specimen (Figure 8b). The precipitates in the 1.5 wt % Cu specimen were tiny, few, and irregularly scattered at grain boundaries. Cu-rich precipitates were continuous or semicontinuous at grain boundaries as the Cu dosage was raised to 2.0 wt %, as shown in Figure 8c.

寻找暑期实习

麻紫嫣 Tzu-Yen Ma

北京邮电大学 (BUPT)

2027届硕士研究生

意向方向：多模态推理/评测/数据合成/Agent算法工程师

maziyan@bupt.edu.cn

张博 Bo Zhang

北京邮电大学 (BUPT)

2027届硕士研究生

意向方向：多模态推理/评测/数据合成/Agent算法工程师

zhangbol@bupt.edu.cn