ghndrx/kubeflow-pipelines

mirror of https://github.com/ghndrx/kubeflow-pipelines.git synced 2026-02-10 06:45:13 +00:00

Files

Greg Hendrickson 210d9c8999 docs: clean up README and USE_CASES formatting

2026-02-03 17:07:07 +00:00

2.7 KiB

Raw Blame History

Healthcare ML Use Cases & Datasets

Curated list of healthcare/biomedical use cases with publicly available datasets.

Implemented

1. Drug-Drug Interaction (DDI) Classification

Dataset: DrugBank (bundled)
Task: Classify interaction severity
Size: 176K samples
Labels: Minor, Moderate, Major, Contraindicated
Status: Production ready

2. Adverse Drug Event Detection

Dataset: ade-benchmark-corpus/ade_corpus_v2
Task: Binary classification for ADE presence
Size: 30K samples
Labels: ADE / No ADE
Status: Production ready

3. Symptom-to-Disease Prediction

Dataset: shanover/disease_symptoms_prec_full
Task: Predict disease from symptoms
Size: ~5K samples
Labels: 41 disease categories
Status: Production ready

4. Medical Triage Classification

Dataset: shubham212/Medical_Triage_Classification
Task: Classify urgency level
Labels: Emergency, Urgent, Standard, Non-urgent
Status: Production ready (needs more training data)

Future Candidates

PubMed Multi-Label Classification (MeSH)

Dataset: owaiskha9654/PubMed_MultiLabel_Text_Classification_Dataset_MeSH
Task: Assign MeSH subject headings to articles
Size: 50K articles
Use Case: Literature categorization

MedMCQA - Medical Exam QA

Dataset: openlifescienceai/medmcqa
Task: Answer medical entrance exam questions
Size: 194K MCQs
Use Case: Medical education, knowledge testing

PubMedQA - Research Question Answering

Dataset: qiaojin/PubMedQA
Task: Yes/No/Maybe from abstracts
Size: 274K samples
Use Case: Evidence-based medicine

Medical Abbreviation Disambiguation

Dataset: McGill-NLP/medal
Task: Disambiguate abbreviations in context
Size: 4GB curated
Use Case: Clinical note processing

BioInstruct

Dataset: bio-nlp-umass/bioinstruct
Task: Instruction-tuned biomedical tasks
Size: 25K instructions
Use Case: General biomedical assistant

Dataset Comparison

Dataset	Size	Task	Complexity
DDI (DrugBank)	176K	4-class	Medium
ADE Corpus	30K	Binary	Low
PubMed MeSH	50K	Multi-label	High
MedMCQA	194K	MCQ	High
PubMedQA	274K	3-class	Medium
Symptom-Disease	5K	41-class	Medium
Triage	5K	4-class	Low

Additional Resources

MIMIC-III/IV: ICU clinical data (PhysioNet access required)
n2c2 Challenges: Clinical NLP shared tasks
i2b2: De-identified clinical records
ChemProt: Chemical-protein interactions
BC5CDR: Chemical-disease relations