ghndrx/kubeflow-pipelines

Fork 0

mirror of https://github.com/ghndrx/kubeflow-pipelines.git synced 2026-02-10 06:45:13 +00:00

Files

Greg Hendrickson f8a0e00a7f docs: add healthcare ML use cases and datasets roadmap

2026-02-03 16:17:33 +00:00

4.7 KiB

Raw Blame History

Healthcare ML Use Cases & Datasets

Curated list of similar healthcare/biomedical use cases with publicly available datasets for training on RunPod.

🔥 Priority 1: Ready to Train

1. Adverse Drug Event Classification

Dataset: Lots-of-LoRAs/task1495_adverse_drug_event_classification

Task: Classify text for presence of adverse drug events
Size: ~10K samples
Labels: Binary (adverse event / no adverse event)
Use Case: Pharmacovigilance, FDA reporting automation
Model: Bio_ClinicalBERT

from datasets import load_dataset
ds = load_dataset("Lots-of-LoRAs/task1495_adverse_drug_event_classification")

2. PubMed Multi-Label Classification (MeSH)

Dataset: owaiskha9654/PubMed_MultiLabel_Text_Classification_Dataset_MeSH

Task: Assign MeSH medical subject headings to research articles
Size: ~50K articles
Labels: Multi-label (medical topics)
Use Case: Literature categorization, research discovery
Model: PubMedBERT

from datasets import load_dataset
ds = load_dataset("owaiskha9654/PubMed_MultiLabel_Text_Classification_Dataset_MeSH")

3. Symptom-to-Disease Prediction

Dataset: shanover/disease_symptoms_prec_full

Task: Predict disease from symptom descriptions
Size: Variable
Labels: Disease categories
Use Case: Triage, symptom checker apps
Model: Bio_ClinicalBERT

from datasets import load_dataset
ds = load_dataset("shanover/disease_symptoms_prec_full")

4. Medical Triage Classification

Dataset: shubham212/Medical_Triage_Classification

Task: Classify urgency level of medical cases
Size: ~500 downloads (popular)
Labels: Triage levels (Emergency, Urgent, Standard)
Use Case: ER automation, telemedicine routing
Model: Bio_ClinicalBERT

📚 Priority 2: QA & Reasoning

5. MedMCQA - Medical Exam Questions

Dataset: openlifescienceai/medmcqa (24K downloads!)

Task: Answer medical entrance exam questions
Size: 194K MCQs covering 2.4K healthcare topics
Labels: Multiple choice (A/B/C/D)
Use Case: Medical education, knowledge testing
Model: Llama-3 or Gemma (LLM fine-tuning)

from datasets import load_dataset
ds = load_dataset("openlifescienceai/medmcqa")

6. PubMedQA - Research Question Answering

Dataset: qiaojin/PubMedQA (18K downloads!)

Task: Answer yes/no/maybe questions from abstracts
Size: 274K samples
Labels: yes / no / maybe
Use Case: Evidence-based medicine, literature review
Model: PubMedBERT or Bio_ClinicalBERT

from datasets import load_dataset
ds = load_dataset("qiaojin/PubMedQA")

🧬 Priority 3: Specialized NLP

7. Medical Abbreviation Disambiguation (MeDAL)

Dataset: McGill-NLP/medal

Task: Disambiguate medical abbreviations in context
Size: 14GB → curated to 4GB
Labels: Abbreviation meanings
Use Case: Clinical note processing, EHR parsing
Model: Bio_ClinicalBERT

8. BioInstruct - Instruction Following

Dataset: bio-nlp-umass/bioinstruct

Task: Instruction-tuned biomedical tasks
Size: 25K instructions
Labels: Various biomedical tasks
Use Case: General biomedical assistant
Model: Llama-3 or Mistral (LoRA fine-tuning)

🛠️ Implementation Roadmap

Week 1: Adverse Drug Events

Download ADE dataset
Add to handler.py as new training mode
Train classifier → S3
Build inference endpoint

Week 2: PubMed Classification

Download PubMed MeSH dataset
Multi-label classification head
Train → S3
Literature search API

Week 3: Medical QA

Download MedMCQA
LLM fine-tuning with LoRA
Deploy QA endpoint

Week 4: Symptom Checker

Symptom-disease dataset
Train classifier
Build symptom input → disease prediction API

📊 Dataset Comparison

Dataset	Size	Task	Difficulty	Business Value
DDI (current)	176K	Classification	Medium	⭐⭐⭐⭐⭐
Adverse Events	10K	Binary	Easy	⭐⭐⭐⭐⭐
PubMed MeSH	50K	Multi-label	Medium	⭐⭐⭐⭐
MedMCQA	194K	MCQ	Hard	⭐⭐⭐⭐
PubMedQA	274K	Yes/No/Maybe	Medium	⭐⭐⭐⭐
Symptom→Disease	Varies	Classification	Easy	⭐⭐⭐⭐⭐
Triage	~5K	Classification	Easy	⭐⭐⭐⭐⭐

🔗 Additional Resources

MIMIC-III/IV: ICU clinical data (requires PhysioNet access)
n2c2 Challenges: Clinical NLP shared tasks
i2b2: De-identified clinical records
ChemProt: Chemical-protein interactions
BC5CDR: Chemical-disease relations

Generated: 2026-02-03

4.7 KiB Raw Blame History