docs: add healthcare ML use cases and datasets roadmap

2026-02-10 06:45:13 +00:00 · 2026-02-03 16:17:33 +00:00
parent 2e479fc61b
commit f8a0e00a7f
1 changed files with 158 additions and 0 deletions
--- a/USE_CASES.md
+++ b/USE_CASES.md
@@ -0,0 +1,158 @@
 # Healthcare ML Use Cases & Datasets
 Curated list of similar healthcare/biomedical use cases with publicly available datasets for training on RunPod.
 ---
 ## 🔥 Priority 1: Ready to Train
 ### 1. Adverse Drug Event Classification
 **Dataset:** `Lots-of-LoRAs/task1495_adverse_drug_event_classification`
 - **Task:** Classify text for presence of adverse drug events
 - **Size:** ~10K samples
 - **Labels:** Binary (adverse event / no adverse event)
 - **Use Case:** Pharmacovigilance, FDA reporting automation
 - **Model:** Bio_ClinicalBERT
 ```python
 from datasets import load_dataset
 ds = load_dataset("Lots-of-LoRAs/task1495_adverse_drug_event_classification")
 ```
 ### 2. PubMed Multi-Label Classification (MeSH)
 **Dataset:** `owaiskha9654/PubMed_MultiLabel_Text_Classification_Dataset_MeSH`
 - **Task:** Assign MeSH medical subject headings to research articles
 - **Size:** ~50K articles
 - **Labels:** Multi-label (medical topics)
 - **Use Case:** Literature categorization, research discovery
 - **Model:** PubMedBERT
 ```python
 from datasets import load_dataset
 ds = load_dataset("owaiskha9654/PubMed_MultiLabel_Text_Classification_Dataset_MeSH")
 ```
 ### 3. Symptom-to-Disease Prediction
 **Dataset:** `shanover/disease_symptoms_prec_full`
 - **Task:** Predict disease from symptom descriptions
 - **Size:** Variable
 - **Labels:** Disease categories
 - **Use Case:** Triage, symptom checker apps
 - **Model:** Bio_ClinicalBERT
 ```python
 from datasets import load_dataset
 ds = load_dataset("shanover/disease_symptoms_prec_full")
 ```
 ### 4. Medical Triage Classification
 **Dataset:** `shubham212/Medical_Triage_Classification`
 - **Task:** Classify urgency level of medical cases
 - **Size:** ~500 downloads (popular)
 - **Labels:** Triage levels (Emergency, Urgent, Standard)
 - **Use Case:** ER automation, telemedicine routing
 - **Model:** Bio_ClinicalBERT
 ---
 ## 📚 Priority 2: QA & Reasoning
 ### 5. MedMCQA - Medical Exam Questions
 **Dataset:** `openlifescienceai/medmcqa` (24K downloads!)
 - **Task:** Answer medical entrance exam questions
 - **Size:** 194K MCQs covering 2.4K healthcare topics
 - **Labels:** Multiple choice (A/B/C/D)
 - **Use Case:** Medical education, knowledge testing
 - **Model:** Llama-3 or Gemma (LLM fine-tuning)
 ```python
 from datasets import load_dataset
 ds = load_dataset("openlifescienceai/medmcqa")
 ```
 ### 6. PubMedQA - Research Question Answering
 **Dataset:** `qiaojin/PubMedQA` (18K downloads!)
 - **Task:** Answer yes/no/maybe questions from abstracts
 - **Size:** 274K samples
 - **Labels:** yes / no / maybe
 - **Use Case:** Evidence-based medicine, literature review
 - **Model:** PubMedBERT or Bio_ClinicalBERT
 ```python
 from datasets import load_dataset
 ds = load_dataset("qiaojin/PubMedQA")
 ```
 ---
 ## 🧬 Priority 3: Specialized NLP
 ### 7. Medical Abbreviation Disambiguation (MeDAL)
 **Dataset:** `McGill-NLP/medal`
 - **Task:** Disambiguate medical abbreviations in context
 - **Size:** 14GB → curated to 4GB
 - **Labels:** Abbreviation meanings
 - **Use Case:** Clinical note processing, EHR parsing
 - **Model:** Bio_ClinicalBERT
 ### 8. BioInstruct - Instruction Following
 **Dataset:** `bio-nlp-umass/bioinstruct`
 - **Task:** Instruction-tuned biomedical tasks
 - **Size:** 25K instructions
 - **Labels:** Various biomedical tasks
 - **Use Case:** General biomedical assistant
 - **Model:** Llama-3 or Mistral (LoRA fine-tuning)
 ---
 ## 🛠️ Implementation Roadmap
 ### Week 1: Adverse Drug Events
 1. Download ADE dataset
 2. Add to handler.py as new training mode
 3. Train classifier → S3
 4. Build inference endpoint
 ### Week 2: PubMed Classification
 1. Download PubMed MeSH dataset
 2. Multi-label classification head
 3. Train → S3
 4. Literature search API
 ### Week 3: Medical QA
 1. Download MedMCQA
 2. LLM fine-tuning with LoRA
 3. Deploy QA endpoint
 ### Week 4: Symptom Checker
 1. Symptom-disease dataset
 2. Train classifier
 3. Build symptom input → disease prediction API
 ---
 ## 📊 Dataset Comparison
 | Dataset | Size | Task | Difficulty | Business Value |
 |---------|------|------|------------|----------------|
 | DDI (current) | 176K | Classification | Medium | ⭐⭐⭐⭐⭐ |
 | Adverse Events | 10K | Binary | Easy | ⭐⭐⭐⭐⭐ |
 | PubMed MeSH | 50K | Multi-label | Medium | ⭐⭐⭐⭐ |
 | MedMCQA | 194K | MCQ | Hard | ⭐⭐⭐⭐ |
 | PubMedQA | 274K | Yes/No/Maybe | Medium | ⭐⭐⭐⭐ |
 | Symptom→Disease | Varies | Classification | Easy | ⭐⭐⭐⭐⭐ |
 | Triage | ~5K | Classification | Easy | ⭐⭐⭐⭐⭐ |
 ---
 ## 🔗 Additional Resources
 - **MIMIC-III/IV:** ICU clinical data (requires PhysioNet access)
 - **n2c2 Challenges:** Clinical NLP shared tasks
 - **i2b2:** De-identified clinical records
 - **ChemProt:** Chemical-protein interactions
 - **BC5CDR:** Chemical-disease relations
 ---
 *Generated: 2026-02-03*