Files
kubeflow-pipelines/USE_CASES.md

2.7 KiB

Healthcare ML Use Cases & Datasets

Curated list of healthcare/biomedical use cases with publicly available datasets.


Implemented

1. Drug-Drug Interaction (DDI) Classification

  • Dataset: DrugBank (bundled)
  • Task: Classify interaction severity
  • Size: 176K samples
  • Labels: Minor, Moderate, Major, Contraindicated
  • Status: Production ready

2. Adverse Drug Event Detection

  • Dataset: ade-benchmark-corpus/ade_corpus_v2
  • Task: Binary classification for ADE presence
  • Size: 30K samples
  • Labels: ADE / No ADE
  • Status: Production ready

3. Symptom-to-Disease Prediction

  • Dataset: shanover/disease_symptoms_prec_full
  • Task: Predict disease from symptoms
  • Size: ~5K samples
  • Labels: 41 disease categories
  • Status: Production ready

4. Medical Triage Classification

  • Dataset: shubham212/Medical_Triage_Classification
  • Task: Classify urgency level
  • Labels: Emergency, Urgent, Standard, Non-urgent
  • Status: Production ready (needs more training data)

Future Candidates

PubMed Multi-Label Classification (MeSH)

  • Dataset: owaiskha9654/PubMed_MultiLabel_Text_Classification_Dataset_MeSH
  • Task: Assign MeSH subject headings to articles
  • Size: 50K articles
  • Use Case: Literature categorization

MedMCQA - Medical Exam QA

  • Dataset: openlifescienceai/medmcqa
  • Task: Answer medical entrance exam questions
  • Size: 194K MCQs
  • Use Case: Medical education, knowledge testing

PubMedQA - Research Question Answering

  • Dataset: qiaojin/PubMedQA
  • Task: Yes/No/Maybe from abstracts
  • Size: 274K samples
  • Use Case: Evidence-based medicine

Medical Abbreviation Disambiguation

  • Dataset: McGill-NLP/medal
  • Task: Disambiguate abbreviations in context
  • Size: 4GB curated
  • Use Case: Clinical note processing

BioInstruct

  • Dataset: bio-nlp-umass/bioinstruct
  • Task: Instruction-tuned biomedical tasks
  • Size: 25K instructions
  • Use Case: General biomedical assistant

Dataset Comparison

Dataset Size Task Complexity
DDI (DrugBank) 176K 4-class Medium
ADE Corpus 30K Binary Low
PubMed MeSH 50K Multi-label High
MedMCQA 194K MCQ High
PubMedQA 274K 3-class Medium
Symptom-Disease 5K 41-class Medium
Triage 5K 4-class Low

Additional Resources

  • MIMIC-III/IV: ICU clinical data (PhysioNet access required)
  • n2c2 Challenges: Clinical NLP shared tasks
  • i2b2: De-identified clinical records
  • ChemProt: Chemical-protein interactions
  • BC5CDR: Chemical-disease relations