mirror of
https://github.com/ghndrx/kubeflow-pipelines.git
synced 2026-02-10 06:45:13 +00:00
93 lines
2.7 KiB
Markdown
93 lines
2.7 KiB
Markdown
# Healthcare ML Use Cases & Datasets
|
|
|
|
Curated list of healthcare/biomedical use cases with publicly available datasets.
|
|
|
|
---
|
|
|
|
## Implemented
|
|
|
|
### 1. Drug-Drug Interaction (DDI) Classification
|
|
- **Dataset:** DrugBank (bundled)
|
|
- **Task:** Classify interaction severity
|
|
- **Size:** 176K samples
|
|
- **Labels:** Minor, Moderate, Major, Contraindicated
|
|
- **Status:** Production ready
|
|
|
|
### 2. Adverse Drug Event Detection
|
|
- **Dataset:** `ade-benchmark-corpus/ade_corpus_v2`
|
|
- **Task:** Binary classification for ADE presence
|
|
- **Size:** 30K samples
|
|
- **Labels:** ADE / No ADE
|
|
- **Status:** Production ready
|
|
|
|
### 3. Symptom-to-Disease Prediction
|
|
- **Dataset:** `shanover/disease_symptoms_prec_full`
|
|
- **Task:** Predict disease from symptoms
|
|
- **Size:** ~5K samples
|
|
- **Labels:** 41 disease categories
|
|
- **Status:** Production ready
|
|
|
|
### 4. Medical Triage Classification
|
|
- **Dataset:** `shubham212/Medical_Triage_Classification`
|
|
- **Task:** Classify urgency level
|
|
- **Labels:** Emergency, Urgent, Standard, Non-urgent
|
|
- **Status:** Production ready (needs more training data)
|
|
|
|
---
|
|
|
|
## Future Candidates
|
|
|
|
### PubMed Multi-Label Classification (MeSH)
|
|
- **Dataset:** `owaiskha9654/PubMed_MultiLabel_Text_Classification_Dataset_MeSH`
|
|
- **Task:** Assign MeSH subject headings to articles
|
|
- **Size:** 50K articles
|
|
- **Use Case:** Literature categorization
|
|
|
|
### MedMCQA - Medical Exam QA
|
|
- **Dataset:** `openlifescienceai/medmcqa`
|
|
- **Task:** Answer medical entrance exam questions
|
|
- **Size:** 194K MCQs
|
|
- **Use Case:** Medical education, knowledge testing
|
|
|
|
### PubMedQA - Research Question Answering
|
|
- **Dataset:** `qiaojin/PubMedQA`
|
|
- **Task:** Yes/No/Maybe from abstracts
|
|
- **Size:** 274K samples
|
|
- **Use Case:** Evidence-based medicine
|
|
|
|
### Medical Abbreviation Disambiguation
|
|
- **Dataset:** `McGill-NLP/medal`
|
|
- **Task:** Disambiguate abbreviations in context
|
|
- **Size:** 4GB curated
|
|
- **Use Case:** Clinical note processing
|
|
|
|
### BioInstruct
|
|
- **Dataset:** `bio-nlp-umass/bioinstruct`
|
|
- **Task:** Instruction-tuned biomedical tasks
|
|
- **Size:** 25K instructions
|
|
- **Use Case:** General biomedical assistant
|
|
|
|
---
|
|
|
|
## Dataset Comparison
|
|
|
|
| Dataset | Size | Task | Complexity |
|
|
|---------|------|------|------------|
|
|
| DDI (DrugBank) | 176K | 4-class | Medium |
|
|
| ADE Corpus | 30K | Binary | Low |
|
|
| PubMed MeSH | 50K | Multi-label | High |
|
|
| MedMCQA | 194K | MCQ | High |
|
|
| PubMedQA | 274K | 3-class | Medium |
|
|
| Symptom-Disease | 5K | 41-class | Medium |
|
|
| Triage | 5K | 4-class | Low |
|
|
|
|
---
|
|
|
|
## Additional Resources
|
|
|
|
- **MIMIC-III/IV:** ICU clinical data (PhysioNet access required)
|
|
- **n2c2 Challenges:** Clinical NLP shared tasks
|
|
- **i2b2:** De-identified clinical records
|
|
- **ChemProt:** Chemical-protein interactions
|
|
- **BC5CDR:** Chemical-disease relations
|