1. We introduce GEMeX, a large-scale Med-VQA dataset for chest X-rays, designed to support diverse question types and provide enhanced explainability for medical VQA systems. To our knowledge, it is the largest chest X-ray VQA dataset and the first Med-VQA dataset to embody the concept of multimodal explainability.
2. We systematically benchmark 10 representative LVLMs using GEMeX, introducing multiple evaluation metrics to comprehensively demonstrate the performance of current popular LVLMs on the Med-VQA task.
3. We show that our proposed precise vision-text explainability notably enhances the visual reasoning ability of LVLMs through fine-tuning, addressing a key deficiency observed in various models. We highlight the importance of a large-scale, groundable, and explainable VQA benchmark for advancing the development and deployment of LVLMs in healthcare.
BibTex Code Here