docs: clean up README and USE_CASES formatting

2026-02-10 06:45:13 +00:00 · 2026-02-03 17:07:07 +00:00
parent 0bf3837e78
commit 210d9c8999
2 changed files with 107 additions and 174 deletions
--- a/USE_CASES.md
+++ b/USE_CASES.md
@@ -1,158 +1,92 @@
 # Healthcare ML Use Cases & Datasets

-Curated list of similar healthcare/biomedical use cases with publicly available datasets for training on RunPod.
+Curated list of healthcare/biomedical use cases with publicly available datasets.

 ---

-## 🔥 Priority 1: Ready to Train
+## Implemented

-### 1. Adverse Drug Event Classification
-**Dataset:** `Lots-of-LoRAs/task1495_adverse_drug_event_classification`
- **Task:** Classify text for presence of adverse drug events
- **Size:** ~10K samples
- **Labels:** Binary (adverse event / no adverse event)
- **Use Case:** Pharmacovigilance, FDA reporting automation
- **Model:** Bio_ClinicalBERT
+### 1. Drug-Drug Interaction (DDI) Classification
+- **Dataset:** DrugBank (bundled)
+- **Task:** Classify interaction severity
+- **Size:** 176K samples
+- **Labels:** Minor, Moderate, Major, Contraindicated
+- **Status:** Production ready

-```python
-from datasets import load_dataset
-ds = load_dataset("Lots-of-LoRAs/task1495_adverse_drug_event_classification")
-```
-
-### 2. PubMed Multi-Label Classification (MeSH)
-**Dataset:** `owaiskha9654/PubMed_MultiLabel_Text_Classification_Dataset_MeSH`
- **Task:** Assign MeSH medical subject headings to research articles
- **Size:** ~50K articles
- **Labels:** Multi-label (medical topics)
- **Use Case:** Literature categorization, research discovery
- **Model:** PubMedBERT
-
-```python
-from datasets import load_dataset
-ds = load_dataset("owaiskha9654/PubMed_MultiLabel_Text_Classification_Dataset_MeSH")
-```
+### 2. Adverse Drug Event Detection
+- **Dataset:** `ade-benchmark-corpus/ade_corpus_v2`
+- **Task:** Binary classification for ADE presence
+- **Size:** 30K samples
+- **Labels:** ADE / No ADE
+- **Status:** Production ready

 ### 3. Symptom-to-Disease Prediction
-**Dataset:** `shanover/disease_symptoms_prec_full`
- **Task:** Predict disease from symptom descriptions
- **Size:** Variable
- **Labels:** Disease categories
- **Use Case:** Triage, symptom checker apps
- **Model:** Bio_ClinicalBERT
-
-```python
-from datasets import load_dataset
-ds = load_dataset("shanover/disease_symptoms_prec_full")
-```
+- **Dataset:** `shanover/disease_symptoms_prec_full`
+- **Task:** Predict disease from symptoms
+- **Size:** ~5K samples
+- **Labels:** 41 disease categories
+- **Status:** Production ready

 ### 4. Medical Triage Classification
-**Dataset:** `shubham212/Medical_Triage_Classification`
- **Task:** Classify urgency level of medical cases
- **Size:** ~500 downloads (popular)
- **Labels:** Triage levels (Emergency, Urgent, Standard)
- **Use Case:** ER automation, telemedicine routing
- **Model:** Bio_ClinicalBERT
+- **Dataset:** `shubham212/Medical_Triage_Classification`
+- **Task:** Classify urgency level
+- **Labels:** Emergency, Urgent, Standard, Non-urgent
+- **Status:** Production ready (needs more training data)

 ---

-## 📚 Priority 2: QA & Reasoning
+## Future Candidates

-### 5. MedMCQA - Medical Exam Questions
-**Dataset:** `openlifescienceai/medmcqa` (24K downloads!)
+### PubMed Multi-Label Classification (MeSH)
+- **Dataset:** `owaiskha9654/PubMed_MultiLabel_Text_Classification_Dataset_MeSH`
+- **Task:** Assign MeSH subject headings to articles
+- **Size:** 50K articles
+- **Use Case:** Literature categorization
+
+### MedMCQA - Medical Exam QA
+- **Dataset:** `openlifescienceai/medmcqa`
 - **Task:** Answer medical entrance exam questions
- **Size:** 194K MCQs covering 2.4K healthcare topics
- **Labels:** Multiple choice (A/B/C/D)
+- **Size:** 194K MCQs
 - **Use Case:** Medical education, knowledge testing
- **Model:** Llama-3 or Gemma (LLM fine-tuning)

-```python
-from datasets import load_dataset
-ds = load_dataset("openlifescienceai/medmcqa")
-```
-
-### 6. PubMedQA - Research Question Answering
-**Dataset:** `qiaojin/PubMedQA` (18K downloads!)
- **Task:** Answer yes/no/maybe questions from abstracts
+### PubMedQA - Research Question Answering
+- **Dataset:** `qiaojin/PubMedQA`
+- **Task:** Yes/No/Maybe from abstracts
 - **Size:** 274K samples
- **Labels:** yes / no / maybe
- **Use Case:** Evidence-based medicine, literature review
- **Model:** PubMedBERT or Bio_ClinicalBERT
+- **Use Case:** Evidence-based medicine

-```python
-from datasets import load_dataset
-ds = load_dataset("qiaojin/PubMedQA")
-```
+### Medical Abbreviation Disambiguation
+- **Dataset:** `McGill-NLP/medal`
+- **Task:** Disambiguate abbreviations in context
+- **Size:** 4GB curated
+- **Use Case:** Clinical note processing

---
-
-## 🧬 Priority 3: Specialized NLP
-
-### 7. Medical Abbreviation Disambiguation (MeDAL)
-**Dataset:** `McGill-NLP/medal`
- **Task:** Disambiguate medical abbreviations in context
- **Size:** 14GB → curated to 4GB
- **Labels:** Abbreviation meanings
- **Use Case:** Clinical note processing, EHR parsing
- **Model:** Bio_ClinicalBERT
-
-### 8. BioInstruct - Instruction Following
-**Dataset:** `bio-nlp-umass/bioinstruct`
+### BioInstruct
+- **Dataset:** `bio-nlp-umass/bioinstruct`
 - **Task:** Instruction-tuned biomedical tasks
 - **Size:** 25K instructions
- **Labels:** Various biomedical tasks
 - **Use Case:** General biomedical assistant
- **Model:** Llama-3 or Mistral (LoRA fine-tuning)

 ---

-## 🛠️ Implementation Roadmap
+## Dataset Comparison

-### Week 1: Adverse Drug Events
-1. Download ADE dataset
-2. Add to handler.py as new training mode
-3. Train classifier → S3
-4. Build inference endpoint
-
-### Week 2: PubMed Classification
-1. Download PubMed MeSH dataset
-2. Multi-label classification head
-3. Train → S3
-4. Literature search API
-
-### Week 3: Medical QA
-1. Download MedMCQA
-2. LLM fine-tuning with LoRA
-3. Deploy QA endpoint
-
-### Week 4: Symptom Checker
-1. Symptom-disease dataset
-2. Train classifier
-3. Build symptom input → disease prediction API
+| Dataset | Size | Task | Complexity |
+|---------|------|------|------------|
+| DDI (DrugBank) | 176K | 4-class | Medium |
+| ADE Corpus | 30K | Binary | Low |
+| PubMed MeSH | 50K | Multi-label | High |
+| MedMCQA | 194K | MCQ | High |
+| PubMedQA | 274K | 3-class | Medium |
+| Symptom-Disease | 5K | 41-class | Medium |
+| Triage | 5K | 4-class | Low |

 ---

-## 📊 Dataset Comparison
+## Additional Resources

-| Dataset | Size | Task | Difficulty | Business Value |
-|---------|------|------|------------|----------------|
-| DDI (current) | 176K | Classification | Medium | ⭐⭐⭐⭐⭐ |
-| Adverse Events | 10K | Binary | Easy | ⭐⭐⭐⭐⭐ |
-| PubMed MeSH | 50K | Multi-label | Medium | ⭐⭐⭐⭐ |
-| MedMCQA | 194K | MCQ | Hard | ⭐⭐⭐⭐ |
-| PubMedQA | 274K | Yes/No/Maybe | Medium | ⭐⭐⭐⭐ |
-| Symptom→Disease | Varies | Classification | Easy | ⭐⭐⭐⭐⭐ |
-| Triage | ~5K | Classification | Easy | ⭐⭐⭐⭐⭐ |
-
---
-
-## 🔗 Additional Resources
-
- **MIMIC-III/IV:** ICU clinical data (requires PhysioNet access)
+- **MIMIC-III/IV:** ICU clinical data (PhysioNet access required)
 - **n2c2 Challenges:** Clinical NLP shared tasks
 - **i2b2:** De-identified clinical records
 - **ChemProt:** Chemical-protein interactions
 - **BC5CDR:** Chemical-disease relations
-
---
-
-*Generated: 2026-02-03*