mirror of
https://github.com/ghndrx/kubeflow-pipelines.git
synced 2026-02-10 06:45:13 +00:00
4.7 KiB
4.7 KiB
Healthcare ML Use Cases & Datasets
Curated list of similar healthcare/biomedical use cases with publicly available datasets for training on RunPod.
🔥 Priority 1: Ready to Train
1. Adverse Drug Event Classification
Dataset: Lots-of-LoRAs/task1495_adverse_drug_event_classification
- Task: Classify text for presence of adverse drug events
- Size: ~10K samples
- Labels: Binary (adverse event / no adverse event)
- Use Case: Pharmacovigilance, FDA reporting automation
- Model: Bio_ClinicalBERT
from datasets import load_dataset
ds = load_dataset("Lots-of-LoRAs/task1495_adverse_drug_event_classification")
2. PubMed Multi-Label Classification (MeSH)
Dataset: owaiskha9654/PubMed_MultiLabel_Text_Classification_Dataset_MeSH
- Task: Assign MeSH medical subject headings to research articles
- Size: ~50K articles
- Labels: Multi-label (medical topics)
- Use Case: Literature categorization, research discovery
- Model: PubMedBERT
from datasets import load_dataset
ds = load_dataset("owaiskha9654/PubMed_MultiLabel_Text_Classification_Dataset_MeSH")
3. Symptom-to-Disease Prediction
Dataset: shanover/disease_symptoms_prec_full
- Task: Predict disease from symptom descriptions
- Size: Variable
- Labels: Disease categories
- Use Case: Triage, symptom checker apps
- Model: Bio_ClinicalBERT
from datasets import load_dataset
ds = load_dataset("shanover/disease_symptoms_prec_full")
4. Medical Triage Classification
Dataset: shubham212/Medical_Triage_Classification
- Task: Classify urgency level of medical cases
- Size: ~500 downloads (popular)
- Labels: Triage levels (Emergency, Urgent, Standard)
- Use Case: ER automation, telemedicine routing
- Model: Bio_ClinicalBERT
📚 Priority 2: QA & Reasoning
5. MedMCQA - Medical Exam Questions
Dataset: openlifescienceai/medmcqa (24K downloads!)
- Task: Answer medical entrance exam questions
- Size: 194K MCQs covering 2.4K healthcare topics
- Labels: Multiple choice (A/B/C/D)
- Use Case: Medical education, knowledge testing
- Model: Llama-3 or Gemma (LLM fine-tuning)
from datasets import load_dataset
ds = load_dataset("openlifescienceai/medmcqa")
6. PubMedQA - Research Question Answering
Dataset: qiaojin/PubMedQA (18K downloads!)
- Task: Answer yes/no/maybe questions from abstracts
- Size: 274K samples
- Labels: yes / no / maybe
- Use Case: Evidence-based medicine, literature review
- Model: PubMedBERT or Bio_ClinicalBERT
from datasets import load_dataset
ds = load_dataset("qiaojin/PubMedQA")
🧬 Priority 3: Specialized NLP
7. Medical Abbreviation Disambiguation (MeDAL)
Dataset: McGill-NLP/medal
- Task: Disambiguate medical abbreviations in context
- Size: 14GB → curated to 4GB
- Labels: Abbreviation meanings
- Use Case: Clinical note processing, EHR parsing
- Model: Bio_ClinicalBERT
8. BioInstruct - Instruction Following
Dataset: bio-nlp-umass/bioinstruct
- Task: Instruction-tuned biomedical tasks
- Size: 25K instructions
- Labels: Various biomedical tasks
- Use Case: General biomedical assistant
- Model: Llama-3 or Mistral (LoRA fine-tuning)
🛠️ Implementation Roadmap
Week 1: Adverse Drug Events
- Download ADE dataset
- Add to handler.py as new training mode
- Train classifier → S3
- Build inference endpoint
Week 2: PubMed Classification
- Download PubMed MeSH dataset
- Multi-label classification head
- Train → S3
- Literature search API
Week 3: Medical QA
- Download MedMCQA
- LLM fine-tuning with LoRA
- Deploy QA endpoint
Week 4: Symptom Checker
- Symptom-disease dataset
- Train classifier
- Build symptom input → disease prediction API
📊 Dataset Comparison
| Dataset | Size | Task | Difficulty | Business Value |
|---|---|---|---|---|
| DDI (current) | 176K | Classification | Medium | ⭐⭐⭐⭐⭐ |
| Adverse Events | 10K | Binary | Easy | ⭐⭐⭐⭐⭐ |
| PubMed MeSH | 50K | Multi-label | Medium | ⭐⭐⭐⭐ |
| MedMCQA | 194K | MCQ | Hard | ⭐⭐⭐⭐ |
| PubMedQA | 274K | Yes/No/Maybe | Medium | ⭐⭐⭐⭐ |
| Symptom→Disease | Varies | Classification | Easy | ⭐⭐⭐⭐⭐ |
| Triage | ~5K | Classification | Easy | ⭐⭐⭐⭐⭐ |
🔗 Additional Resources
- MIMIC-III/IV: ICU clinical data (requires PhysioNet access)
- n2c2 Challenges: Clinical NLP shared tasks
- i2b2: De-identified clinical records
- ChemProt: Chemical-protein interactions
- BC5CDR: Chemical-disease relations
Generated: 2026-02-03