RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning

Jiahe Song1,2*, Chuang Wang3,2*, Bowen Jiang4,2*, Yinfan Wang2*, Hao Zheng2,5*, Xingjian Wei2, Chengjin Liu6, Rui Nie3,2, Junyuan Gao2, Jiaxing Sun2, Yubin Wang2, Lijun Wu2, Zhenhua Huang5‡, Jiang Wu2‡, Qian Yu3‡, Conghui He2‡,
*Equal Contribution, Correspondence Authors
1Shanghai Jiao Tong University, 2Shanghai AI Laboratory, 3Beihang University, 4Peking University,
5South China Normal University, 6Northwestern Polytechnical University
Paper status - Accepted by CVPR 2026 🎉🎉🎉
RxnCaption Framework Teaser

RxnCaption reformulates chemical Reaction Diagram Parsing (RxnDP) into an image captioning task, leveraging a "BBox and Index as Visual Prompt" strategy to unlock the natural capabilities of Large Vision-Language Models.

Abstract

Large-scale chemical reaction datasets are crucial for AI research in chemistry. However, existing chemical reaction data often exist as images within papers, making them not machine-readable and unusable for training machine learning models. In response to this challenge, we propose the RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP). Our framework reformulates the traditional coordinate prediction driven parsing process into an image captioning problem, which Large Vision-Language Models (LVLMs) handle naturally.

We introduce a strategy termed "BBox and Index as Visual Prompt" (BIVP), which uses our state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the input image. This turns the downstream parsing into a natural-language description problem. Extensive experiments show that the BIVP strategy significantly improves structural extraction quality while simplifying model design.

We further construct the U-RxnDiagram-15k dataset, an order of magnitude larger than prior real-world literature benchmarks, with a balanced test subset across four layout archetypes. Experiments demonstrate that RxnCaption-VL achieves state-of-the-art performance on multiple metrics.

We believe our method, dataset, and models will advance structured information extraction from chemical literature and catalyze broader AI applications in chemistry. We will release data, models, and code on GitHub.

BROS vs BIVP Strategy

BROS vs BIVP Comparison

Unlike the traditional "Bbox and Role in One Step" (BROS) strategy which forces LVLMs to perform coordinate prediction, our BIVP strategy uses visual indices to transform parsing into a Natural Language Description task, perfectly aligning with inherent LVLM capabilities.


The U-RxnDiagram-15k Dataset

Dataset Statistics and distribution

Collected from real-world papers, the U-RxnDiagram-15k dataset contains 15k images, exceeding the RxnScribe dataset by an order of magnitude. We specifically designed a balanced test set to address layout biases across single, multi, tree, and cyclic reactions, creating a rigorously more challenging benchmark for model evaluation.


State-of-the-Art Performance

Model Strategy RxnScribe-test U-RxnDiagram-15k-test
Hybrid Match Soft Match Hybrid Match Soft Match
PRF1 PRF1 PRF1 PRF1
Trained Model
RxnCaption-VL BIVP 71.672.772.2 85.387.186.2 60.359.359.8 71.369.470.4
BROS 69.668.969.2 76.276.276.2 57.057.557.2 66.467.466.9
RxnScribe_w/15k BROS 72.469.170.7 84.181.782.8 61.238.747.4 72.144.755.2
RxnScribe_official BROS 72.366.269.1 83.876.580.0 47.427.634.9 62.136.445.9
RxnIM BROS 71.070.170.5 79.274.776.9 48.830.337.4 52.932.840.5
Open-source Model
Intern-VL3-78B BIVP 33.844.138.3 45.859.851.9 13.015.514.1 26.632.029.0
BROS 0.00.00.0 0.00.00.0 0.10.10.1 0.30.20.3
Qwen2.5-VL-7B BIVP 6.04.14.9 55.836.043.8 2.90.91.4 33.010.315.6
BROS 0.00.00.0 0.00.00.0 0.00.00.0 0.00.00.0
Qwen2.5-VL-72B BIVP 51.948.550.1 70.066.368.1 30.923.826.9 52.841.646.5
BROS 2.01.41.6 15.211.212.9 0.30.30.3 4.21.92.6
Closed-source Model
Gemini-2.5-Pro BIVP 44.756.149.8 67.986.576.1 38.942.140.4 64.269.266.6
BROS 0.00.00.0 25.223.524.3 0.30.20.3 8.94.66.0
GPT4o-2024-11-20 BIVP 26.833.229.6 49.158.053.2 16.116.616.3 32.632.732.6
BROS 0.30.30.3 2.01.81.9 0.00.00.0 0.40.30.3
Qwen-VL-Max BIVP 50.046.948.4 71.167.669.3 34.029.031.3 55.348.451.6
BROS 0.30.30.3 7.25.96.5 0.20.10.2 3.22.62.8

RxnCaption-VL achieves SOTA performance on both RxnScribe-test and U-RxnDiagram-15k-test benchmarks, consistently outperforming leading open-source and proprietary models. BIVP demonstrates a comprehensive advantage, yielding a remarkable 10.0 point surge in Soft-F1 accuracy compared to the BROS baseline.

BibTeX

@article{song2025rxncaption,
  title={RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning},
  author={Song, Jiahe and Wang, Chuang and Jiang, Bowen and Wang, Yinfan and Zheng, Hao and Wei, Xingjian and Liu, Chengjin and Nie, Rui and Gao, Junyuan and Sun, Jiaxing and others},
  journal={arXiv preprint arXiv:2511.02384},
  year={2025}
}