mirror of https://github.com/ghndrx/kubeflow-pipelines.git synced 2026-02-09 22:35:04 +00:00

Files

Greg Hendrickson 5f554ea769 refactor: environment variable configuration for all pipeline settings

- Add config.py with dataclass-based configuration from env vars
- Remove hardcoded RunPod endpoint and credentials
- Consolidate duplicate training components into single reusable function
- Add .env.example with all configurable options
- Update README with environment variable documentation
- Add Kubernetes secrets example for production deployments
- Add timeout and error handling improvements

BREAKING: Pipeline parameters now use env vars by default.
Set RUNPOD_API_KEY, RUNPOD_ENDPOINT, S3_BUCKET, and AWS creds.

2026-02-03 20:47:27 +00:00

4.2 KiB

Raw Permalink Blame History

Healthcare ML Training Pipeline

Serverless GPU training infrastructure for healthcare NLP models. Training runs on RunPod serverless GPUs, with trained models stored in S3.

Overview

This project provides production-ready ML pipelines for training healthcare classification models:

Drug-Drug Interaction (DDI) - Severity classification from DrugBank (176K samples)
Adverse Drug Events (ADE) - Binary detection from ADE Corpus V2 (30K samples)
Medical Triage - Urgency level classification
Symptom-to-Disease - Diagnosis prediction (41 disease classes)

All models use Bio_ClinicalBERT as the base and are fine-tuned on domain-specific datasets.

Training Results

Task	Dataset	Samples	Accuracy	F1 Score
DDI Classification	DrugBank	176K	100%	100%
ADE Detection	ADE Corpus V2	9K	93.5%	95.3%
Symptom-Disease	Disease Symptoms	4.4K	100%	100%

Quick Start

Run Training

curl -X POST "https://api.runpod.ai/v2/YOUR_ENDPOINT/run" \
  -H "Authorization: Bearer $RUNPOD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "task": "ddi",
      "model_name": "emilyalsentzer/Bio_ClinicalBERT",
      "max_samples": 10000,
      "epochs": 3,
      "batch_size": 16,
      "s3_bucket": "your-bucket",
      "aws_access_key_id": "...",
      "aws_secret_access_key": "...",
      "aws_session_token": "..."
    }
  }'

Available tasks: ddi, ade, triage, symptom_disease

Download Trained Model

aws s3 cp s3://your-bucket/model.tar.gz .
tar -xzf model.tar.gz

Project Structure

├── components/
│   └── runpod_trainer/
│       ├── Dockerfile
│       ├── handler.py          # Multi-task training logic
│       ├── requirements.txt
│       └── data/               # DrugBank DDI dataset
├── pipelines/
│   ├── healthcare_training.py  # Kubeflow pipeline definitions
│   ├── ddi_training_runpod.py
│   └── ddi_data_prep.py
├── .github/workflows/
│   └── build-trainer.yaml      # CI/CD
└── manifests/
    └── argocd-app.yaml

Configuration

All configuration is via environment variables. Copy .env.example to .env and fill in your values:

cp .env.example .env
# Edit .env with your credentials

Environment Variables

Variable	Required	Default	Description
`RUNPOD_API_KEY`	Yes	-	RunPod API key
`RUNPOD_ENDPOINT`	Yes	-	RunPod serverless endpoint ID
`AWS_ACCESS_KEY_ID`	Yes	-	AWS credentials for S3
`AWS_SECRET_ACCESS_KEY`	Yes	-	AWS credentials for S3
`AWS_SESSION_TOKEN`	No	-	For assumed role sessions
`AWS_REGION`	No	us-east-1	AWS region
`S3_BUCKET`	Yes	-	Bucket for model artifacts
`BASE_MODEL`	No	Bio_ClinicalBERT	HuggingFace model ID
`MAX_SAMPLES`	No	10000	Training samples
`EPOCHS`	No	3	Training epochs
`BATCH_SIZE`	No	16	Batch size

Kubernetes Secrets (Recommended)

For production, use Kubernetes secrets instead of environment variables:

apiVersion: v1
kind: Secret
metadata:
  name: ml-pipeline-secrets
type: Opaque
stringData:
  RUNPOD_API_KEY: "your-key"
  AWS_ACCESS_KEY_ID: "your-key"
  AWS_SECRET_ACCESS_KEY: "your-secret"

Supported Models

Model	Type	Use Case
`emilyalsentzer/Bio_ClinicalBERT`	BERT	Classification tasks
`meta-llama/Llama-3.1-8B-Instruct`	LLM	Text generation (LoRA)
`google/gemma-3-4b-it`	LLM	Lightweight inference

Parameters

Parameter	Default	Description
`task`	ddi	Training task
`model_name`	Bio_ClinicalBERT	HuggingFace model ID
`max_samples`	10000	Training samples
`epochs`	3	Training epochs
`batch_size`	16	Batch size
`eval_split`	0.1	Validation split
`s3_bucket`	-	S3 bucket for output

Development

# Build container
cd components/runpod_trainer
docker build -t healthcare-trainer .

# Trigger CI build
gh workflow run build-trainer.yaml

License

MIT

4.2 KiB Raw Permalink Blame History