THEMIS

Towards Holistic Evaluation of MLLMs
for Scientific Paper Fraud Forensics

Authors: Tzu-Yen Ma*, Bo Zhang*, Zichen Tang, Junpeng Ding, Haolin Tian, Yuanze Li, Zhuodi Hao, Zixin Ding, Zirui Wang, Xinyu Yu, Shiyao Peng, Yizhuo Zhao, Ruomeng Jiang, Yiling Huang, Peizhi Zhao, Jiayuan Chen, Weisheng Tan, Haocheng Gao, Yang Liu, Jiacheng Liu, Zhongjun Yang, Jiayu Huang, Haihong E

Affiliation: Beijing University of Posts and Telecommunications

* Equal Contribution    † Corresponding Author

THEMIS Overview

Abstract

We present THEMIS, a novel multi-task benchmark designed to comprehensively evaluate multimodal large language models (MLLMs) on visual fraud reasoning within real-world academic scenarios. Compared to existing benchmarks, THEMIS introduces three major advances. (1) Real-World Scenarios and Complexity: Our benchmark comprises over 4,000 questions spanning seven scenarios, derived from authentic retracted-paper cases and carefully curated multimodal synthetic data. With 60.47% complex-texture images, THEMIS bridges the critical gap between existing benchmarks and the complexity of real-world academic fraud. (2) Fraud-Type Diversity and Granularity: THEMIS systematically covers five challenging fraud types and introduces 16 fine-grained manipulation operations. On average, each sample undergoes multiple stacked manipulation operations, with the diversity and difficulty of these manipulations demanding a high level of visual fraud reasoning from the models. (3) MultiDimensional Capability Evaluation: We establish a mapping from fraud types to five core visual fraud reasoning capabilities, thereby enabling an evaluation that reveals the distinct strengths and specific weaknesses of different models across these core capabilities. Experiments on 16 leading MLLMs show that even the best-performing model, GPT-5, achieves an overall performance of only 56.15%, demonstrating that our benchmark presents a stringent test. We expect THEMIS to advance the development of MLLMs for complex, real-world fraud reasoning tasks.

Findings

Main Findings Visual
01
Overall performance remains limited. Across all evaluated tasks, MLLMs exhibit significant room for improvement, with even the SOTA GPT-5 reaching a peak BRI of only 56.15%. Notably, the open-source Qwen2.5-VL-72B achieved 47.16%, performing on par with several proprietary models and highlighting the competitive potential of open-weight architectures.
02
Models exhibit pronounced specialization. Most MLLMs excel in one or two subtasks but fall short in others, revealing substantial imbalance. This specialization highlights the lack of integrated visual fraud reasoning capabilities in current models.
03
Localization is markedly harder. Performance consistently declined across all MLLMs when shifting from Single-Mode Forgery Identification task to Localization task, with GPT-5 dropping by 55% and OpenAI o4-mini-high by 61%. By contrast, Gemini 2.5 Flash exhibited only minor declines (14%), indicating relatively stronger spatial perceptual capacity
04
Limitations in cross-modal alignment. Although they demonstrate relatively high judgment accuracy on Text–Image Inconsistency subtask, they struggle to ground these judgments in specific textual spans. This limitation may stem from an overreliance on global semantic associations while lacking sufficient modeling of local cross-modal mappings.
Visual 1
Visual 2
01
Lack of transformation sensitivity. Models perform reasonably on direct reuse but degrade markedly under geometric transformations or appearance adjustments. They often fail to detect duplication and cannot determine the transformation type, underscoring the limited spatial reasoning ability of current MLLMs and their insufficient robustness to transformations.
02
Current models lack robustness to input perturbations. To test MLLM sensitivity on synthetic data, we applied Gaussian blur, JPEG compression, and scaling. Gaussian blur caused the steepest drops across tasks, while JPEG compression and scaling also degraded performance.
03
Insufficient edge perception. Models perform better on Copy-Move than Splicing, as the former can exploit both edge anomalies and region similarity, while the latter depends mainly on boundary cues.
04
Synthetic fraud data exhibit a level of deceptiveness comparable to real fraud data. In the Composite Manipulation Operations Identification task, the synthetic data impose even greater identification pressure than real data. In the Single-Mode Forgery Identification and Localization task, their difficulty is similarly on par with real data. The main exception lies in Splicing subtask, where real data remain more challenging due to the sophistication and subtlety of splicing patterns in real-world manipulations.

Benchmark

Construction Pipeline

Dataset construction pipeline of THEMIS. The dataset is built through 2 stages: Stage 1: Extraction and Parsing, where figures, captions, and related sentences are parsed from scientific PDFs and segmented into panels; Stage 2: Fraud Data Generation, where 5 major fraud types (Splicing, Copy-Move, AI-Generated, Duplication, and Text–Image Inconsistency) are applied to construct challenging tasks.

QA Design Examples

Evaluation task design of THEMIS. A principled mapping from 5 fraud types to 5 core reasoning capabilities (Expert Knowledge Utilization, Visual Recognition, Spatial Reasoning, Region Localization, and Comparative Reasoning). The capability distribution bars on the right of each box illustrate the reasoning skills involved and their relative emphasis, with the darkest color highlighting the primary capability being evaluated.

Benchmark Data Statistics

Statistics of THEMIS. (a) Distribution of fraud methods. (b) Distribution of manipulation operations (synthetic data). See Table 5 for real cases. IIF: Image Inference Forgery; TRE: Targeted Region Editing; CT: Color Temperature; DR: Direct Reuse; HF: Horizontal Flip; VF: Vertical Flip.


Can You Spot the Fraud?

A) Click the buttons to identify the forgery type.

Case 1
Case 2
Case 3
Case 3
Duplication Case 1 Duplication Case 2

C) Are the text and image consistent?

TII Case

Figure Caption:

EDS analysis at the grain boundaries of Fe-6.5 wt % Si steel strip samples: (a) 1.0 wt % Cu, (b) 1.5 wt % Cu, and (c) 2.0 wt % Cu.

Related Sentences:

Further SEM microstructural examination revealed that Cu rich precipitates were absent in the 1.5 wt % Cu specimen (Figure 8a), but became visible in the 1.5 wt % Cu specimen (Figure 8b). The precipitates in the 1.5 wt % Cu specimen were tiny, few, and irregularly scattered at grain boundaries. Cu-rich precipitates were continuous or semicontinuous at grain boundaries as the Cu dosage was raised to 2.0 wt %, as shown in Figure 8c.


寻找暑期实习

麻紫嫣 Tzu-Yen Ma

北京邮电大学 (BUPT)

2027届硕士研究生

意向方向:多模态推理/评测/数据合成/Agent算法工程师

maziyan@bupt.edu.cn

张博 Bo Zhang

北京邮电大学 (BUPT)

2027届硕士研究生

意向方向:多模态推理/评测/数据合成/Agent算法工程师

zhangbol@bupt.edu.cn