·厦门大学信息学院超智医疗创新研究中心

当前位置:

中文主页 >> 成果展示

OEHR and OVQA: Building Real-World Benchmarks for Orthopedic Clinical Research and Medical VQA

发布时间:2025-12-24 点击次数:
数据简介:

The rapid development of data-driven methods in healthcare has created an urgent demand for high-quality, domain-specific benchmark datasets. Orthopedics, despite its clinical importance and strong reliance on imaging and longitudinal patient records, has long lacked publicly described datasets that faithfully reflect real-world clinical practice. To address this gap, we introduce two complementary resources: OEHR, an Orthopedic Electronic Health Record dataset, and OVQA, a clinically generated Orthopedic Visual Question Answering dataset. Together, they form a solid foundation for advancing orthopedic clinical research and multimodal medical AI.

OEHR: An Orthopedic Electronic Health Record Dataset

OEHR is a newly constructed electronic health record dataset specifically designed for the orthopedic domain. It is sourced directly from the EHR systems of real hospitals, ensuring that the data closely mirrors authentic clinical workflows and patient trajectories. The dataset integrates diverse types of clinical information, including structured records, free-text notes, and medical images, making it a comprehensive benchmark for orthopedic research.

At its core, OEHR organizes patient data into a set of well-defined relational tables. The PATIENTS table contains demographic and identification information for individual patients, each uniquely indexed by a subject_id. Hospitalization events are captured in the ADMISSIONS table, where each admission is associated with a unique admission_id, enabling longitudinal analysis across multiple hospital stays.

Clinical observations and documentation are represented through multiple complementary tables. The NOTES table stores unstructured free-text data such as physician progress notes and discharge summaries, providing rich narrative context for patient care. Quantitative clinical measurements are recorded in the LABEVENTS table, where each laboratory test is identified by a unique labevents_id. Diagnoses assigned during hospital stays are documented in the DIAGNOSES table, with each diagnosis indexed by a diagnosis_id.

Treatment-related information is also a key component of OEHR. The POE (Physician Order Entry) table records prescription and treatment orders issued by clinicians, each linked via a unique poe_id. In addition, OEHR includes an ORTHOPEDIC_IMAGES table, which catalogs orthopedic medical images such as X-rays and CT scans, each identified by a photo_id. This tight integration of structured data, clinical text, and imaging makes OEHR particularly valuable for multimodal learning, clinical decision support, and retrospective outcome analysis.

Overall, OEHR is intended as a benchmark dataset to support a wide range of orthopedic clinical research tasks, from patient outcome modeling to multimodal representation learning.

OVQA: A Clinically Generated Orthopedic VQA Dataset

Complementing OEHR, we introduce OVQA, a medical visual question answering dataset generated from electronic medical records. To the best of our knowledge, OVQA is the first VQA dataset specifically focused on orthopedics and grounded in real clinical practice. The images used in OVQA are derived directly from EMRs, while the question–answer pairs are constructed based on frequently asked questions from hospitals, ensuring strong clinical relevance.

OVQA contains 19,020 medical VQA pairs generated from 2,001 medical images collected across 2,212 orthopedic EMRs. The dataset covers six clinically meaningful question types: Abnormality, Condition Presence, Modality, Organ System, Plane, and Other Attributes. Among these, Abnormality-related questions are the most common, accounting for 31% (5,920) of all questions, reflecting the central role of abnormality detection in orthopedic imaging. In contrast, Plane-related questions are the least frequent, comprising only 4% (795) of the dataset.

From an answer format perspective, OVQA includes both open-ended and closed-ended questions. Approximately 33% (6,260) of the questions are open-ended, while 67% (12,760) are closed-ended. Within the closed-ended subset, “yes/no” questions dominate, representing 76% (9,699) of cases. Importantly, the distribution of “yes” and “no” answers is well balanced, with “yes” accounting for 47%, which helps mitigate answer bias in model training and evaluation.

The image set in OVQA spans two major imaging modalities: CT and X-ray, with 1,410 CT images and 591 X-ray images. These images cover multiple body parts commonly encountered in orthopedic practice, including the hand, leg, chest, and head. Hand images are particularly prevalent, making up 50% (999) of the total, reflecting the frequency of hand-related orthopedic examinations.

By combining clinically grounded questions with real medical images, OVQA provides a challenging and realistic benchmark for evaluating medical VQA systems in orthopedics.

Data Access, Ethics, and Usage Requirements

Access to these datasets requires compliance with data governance and ethical research standards. Users must upload a completion report from the CITI “Data or Specimens Only Research” training program (https://physionet.org/about/citi-course/), listing all completed modules along with dates and scores. Completion certificates alone are not sufficient, and expired reports will not be accepted.

Researchers can apply for dataset access through the following link:
Apply Info for Datasets: https://docs.qq.com/form/page/DRVJDS2NIbERxTVdS

These requirements ensure responsible use of sensitive clinical data and promote best practices in medical research.

Citation Information

If you find OEHR or OVQA useful for your research, please cite the corresponding papers:

@inproceedings{xie2024oehr,
  title={OEHR: An Orthopedic Electronic Health Record Dataset},
  author={Xie, Yibo and Wang, Kaifan and Zheng, Jiawei and Liu, Feiyan and Wang, Xiaoli and Huang, Guofeng},
  booktitle={Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  pages={1126--1135},
  year={2024}
}

@inproceedings{huang2022ovqa,
  title={Ovqa: A clinically generated visual question answering dataset},
  author={Huang, Yefan and Wang, Xiaoli and Liu, Feiyan and Huang, Guofeng},
  booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  pages={2924--2938},
  year={2022}
}

Together, OEHR and OVQA represent a significant step toward realistic, high-quality benchmarks for orthopedic clinical research and multimodal medical AI, bridging the gap between real-world clinical data and advanced machine learning methodologies.


数据关键字:
OEHR, OVQA