mirror of
https://github.com/ghndrx/kubeflow-pipelines.git
synced 2026-02-09 22:35:04 +00:00
2.7 KiB
2.7 KiB
Healthcare ML Use Cases & Datasets
Curated list of healthcare/biomedical use cases with publicly available datasets.
Implemented
1. Drug-Drug Interaction (DDI) Classification
- Dataset: DrugBank (bundled)
- Task: Classify interaction severity
- Size: 176K samples
- Labels: Minor, Moderate, Major, Contraindicated
- Status: Production ready
2. Adverse Drug Event Detection
- Dataset:
ade-benchmark-corpus/ade_corpus_v2 - Task: Binary classification for ADE presence
- Size: 30K samples
- Labels: ADE / No ADE
- Status: Production ready
3. Symptom-to-Disease Prediction
- Dataset:
shanover/disease_symptoms_prec_full - Task: Predict disease from symptoms
- Size: ~5K samples
- Labels: 41 disease categories
- Status: Production ready
4. Medical Triage Classification
- Dataset:
shubham212/Medical_Triage_Classification - Task: Classify urgency level
- Labels: Emergency, Urgent, Standard, Non-urgent
- Status: Production ready (needs more training data)
Future Candidates
PubMed Multi-Label Classification (MeSH)
- Dataset:
owaiskha9654/PubMed_MultiLabel_Text_Classification_Dataset_MeSH - Task: Assign MeSH subject headings to articles
- Size: 50K articles
- Use Case: Literature categorization
MedMCQA - Medical Exam QA
- Dataset:
openlifescienceai/medmcqa - Task: Answer medical entrance exam questions
- Size: 194K MCQs
- Use Case: Medical education, knowledge testing
PubMedQA - Research Question Answering
- Dataset:
qiaojin/PubMedQA - Task: Yes/No/Maybe from abstracts
- Size: 274K samples
- Use Case: Evidence-based medicine
Medical Abbreviation Disambiguation
- Dataset:
McGill-NLP/medal - Task: Disambiguate abbreviations in context
- Size: 4GB curated
- Use Case: Clinical note processing
BioInstruct
- Dataset:
bio-nlp-umass/bioinstruct - Task: Instruction-tuned biomedical tasks
- Size: 25K instructions
- Use Case: General biomedical assistant
Dataset Comparison
| Dataset | Size | Task | Complexity |
|---|---|---|---|
| DDI (DrugBank) | 176K | 4-class | Medium |
| ADE Corpus | 30K | Binary | Low |
| PubMed MeSH | 50K | Multi-label | High |
| MedMCQA | 194K | MCQ | High |
| PubMedQA | 274K | 3-class | Medium |
| Symptom-Disease | 5K | 41-class | Medium |
| Triage | 5K | 4-class | Low |
Additional Resources
- MIMIC-III/IV: ICU clinical data (PhysioNet access required)
- n2c2 Challenges: Clinical NLP shared tasks
- i2b2: De-identified clinical records
- ChemProt: Chemical-protein interactions
- BC5CDR: Chemical-disease relations