MedConceptsQA

A Medical Concepts QA Dataset for LLM Evaluation

484,334

Procedures QA

18,818

Medications QA

316,680

Diagnoses QA

a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. The questions are categorized into three levels of difficulty: easy, medium, and hard. We conducted evaluations of the benchmark using various Large Language Models. Our benchmark serves as a valuable resource for evaluating the abilities of Large Language Models to interpret medical codes and distinguish between medical concepts. Our findings showed most of the state-of-the-art CLLMs, despite being pre-trained on medical data, achieved accuracy levels close to random guessing on this benchmark. However, general-purpose models (Llama3-70B and GPT-4) outperformed CLLMs. Notably, GPT-4 exhibited the best performance, although its accuracy remained insufficient for certain datasets in our benchmark.

Overview

MedConceptsQA Benchmark Results

We evaluated different models on MedConceptsQA benchmark using our evaluation code that is available here. If you wish to submit your model for evaluation, please open a GitHub issue with your model's HuggingFace name here.

Zero-shot Learning Results:

Model Name	Accuracy	CI
gpt-4-0125-preview	52.489	2.064
meta-llama/Meta-Llama-3.1-70B-Instruct	48.471	2.065
m42-health/Llama3-Med42-70B	47.093	2.062
meta-llama/Meta-Llama-3-70B-Instruct	47.076	2.062
aaditya/Llama3-OpenBioLLM-70B	41.849	2.039
HPAI-BSC/Llama3.1-Aloe-Beta-8B	38.462	2.010
gpt-3.5-turbo	37.058	1.996
meta-llama/Meta-Llama-3-8B-Instruct	34.8	1.968
aaditya/Llama3-OpenBioLLM-8B	29.431	1.883
johnsnowlabs/JSL-MedMNX-7B	28.649	1.868
epfl-llm/meditron-70b	28.133	1.858
dmis-lab/meerkat-7b-v1.0	27.982	1.855
BioMistral/BioMistral-7B-DARE	26.836	1.831
epfl-llm/meditron-7b	26.107	1.814
HPAI-BSC/Llama3.1-Aloe-Beta-70B	25.929	1.811
dmis-lab/biobert-v1.1	25.636	1.804
UFNLP/gatortron-large	25.298	1.796
PharMolix/BioMedGPT-LM-7B	24.924	1.787

Few-shot Learning Results:

Model Name	Accuracy	CI
gpt-4-0125-preview	61.911	3.475
meta-llama/Meta-Llama-3.1-70B-Instruct	58.720	3.523
HPAI-BSC/Llama3.1-Aloe-Beta-70B	58.142	3.530
meta-llama/Meta-Llama-3-70B-Instruct	57.867	3.534
m42-health/Llama3-Med42-70B	56.551	3.547
aaditya/Llama3-OpenBioLLM-70B	53.387	3.570
HPAI-BSC/Llama3.1-Aloe-Beta-8B	41.671	3.528
gpt-3.5-turbo	41.476	3.526
meta-llama/Meta-Llama-3-8B-Instruct	40.693	3.516
aaditya/Llama3-OpenBioLLM-8B	35.316	3.421
epfl-llm/meditron-70b	34.809	3.409
johnsnowlabs/JSL-MedMNX-7B	32.436	3.350
BioMistral/BioMistral-7B-DARE	28.702	3.237
PharMolix/BioMedGPT-LM-7B	28.204	3.220
dmis-lab/meerkat-7b-v1.0	28.187	3.219
epfl-llm/meditron-7b	26.231	3.148
dmis-lab/biobert-v1.1	25.982	3.138
UFNLP/gatortron-large	25.093	3.102

MedConceptsQA

Title here

MedConceptsQA

484,334

18,818

316,680

Overview

MedConceptsQA Benchmark Results

Zero-shot Learning Results:

Few-shot Learning Results:

Citation