MedConceptsQA

A Medical Concepts QA Dataset for LLM Evaluation

484,334

Procedures QA

18,818

Medications QA

316,680

Diagnoses QA
a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. The questions are categorized into three levels of difficulty: easy, medium, and hard. We conducted evaluations of the benchmark using various Large Language Models. Our benchmark serves as a valuable resource for evaluating the abilities of Large Language Models to interpret medical codes and distinguish between medical concepts. Our findings showed most of the state-of-the-art CLLMs, despite being pre-trained on medical data, achieved accuracy levels close to random guessing on this benchmark. However, general-purpose models (Llama3-70B and GPT-4) outperformed CLLMs. Notably, GPT-4 exhibited the best performance, although its accuracy remained insufficient for certain datasets in our benchmark.

Overview


MedConceptsQA Benchmark Results


We evaluated different models on MedConceptsQA benchmark using our evaluation code that is available here. If you wish to submit your model for evaluation, please open a GitHub issue with your model's HuggingFace name here.

Zero-shot Learning Results:
Model NameAccuracyCI
gpt-4-0125-preview52.4892.064
meta-llama/Meta-Llama-3.1-70B-Instruct48.4712.065
m42-health/Llama3-Med42-70B47.0932.062
meta-llama/Meta-Llama-3-70B-Instruct47.0762.062
aaditya/Llama3-OpenBioLLM-70B41.8492.039
HPAI-BSC/Llama3.1-Aloe-Beta-8B38.4622.010
gpt-3.5-turbo37.0581.996
meta-llama/Meta-Llama-3-8B-Instruct34.81.968
aaditya/Llama3-OpenBioLLM-8B29.4311.883
johnsnowlabs/JSL-MedMNX-7B28.6491.868
epfl-llm/meditron-70b28.1331.858
dmis-lab/meerkat-7b-v1.027.9821.855
BioMistral/BioMistral-7B-DARE26.8361.831
epfl-llm/meditron-7b26.1071.814
HPAI-BSC/Llama3.1-Aloe-Beta-70B25.9291.811
dmis-lab/biobert-v1.125.6361.804
UFNLP/gatortron-large25.2981.796
PharMolix/BioMedGPT-LM-7B24.9241.787
Few-shot Learning Results:
Model NameAccuracyCI
gpt-4-0125-preview61.9113.475
meta-llama/Meta-Llama-3.1-70B-Instruct58.7203.523
HPAI-BSC/Llama3.1-Aloe-Beta-70B58.1423.530
meta-llama/Meta-Llama-3-70B-Instruct57.8673.534
m42-health/Llama3-Med42-70B56.5513.547
aaditya/Llama3-OpenBioLLM-70B53.3873.570
HPAI-BSC/Llama3.1-Aloe-Beta-8B41.6713.528
gpt-3.5-turbo41.4763.526
meta-llama/Meta-Llama-3-8B-Instruct40.6933.516
aaditya/Llama3-OpenBioLLM-8B35.3163.421
epfl-llm/meditron-70b34.8093.409
johnsnowlabs/JSL-MedMNX-7B32.4363.350
BioMistral/BioMistral-7B-DARE28.7023.237
PharMolix/BioMedGPT-LM-7B28.2043.220
dmis-lab/meerkat-7b-v1.028.1873.219
epfl-llm/meditron-7b26.2313.148
dmis-lab/biobert-v1.125.9823.138
UFNLP/gatortron-large25.0933.102

Citation

@article{SHOHAM2024109089,
    title = {MedConceptsQA: Open source medical concepts QA benchmark},
    journal = {Computers in Biology and Medicine},
    volume = {182},
    pages = {109089},
    year = {2024},
    issn = {0010-4825},
    doi = {https://doi.org/10.1016/j.compbiomed.2024.109089},
    url = {https://www.sciencedirect.com/science/article/pii/S0010482524011740},
    author = {Ofir Ben Shoham and Nadav Rappoport}
}