mirror of
https://github.com/ghndrx/kubeflow-pipelines.git
synced 2026-02-10 06:45:13 +00:00
- Add config.py with dataclass-based configuration from env vars - Remove hardcoded RunPod endpoint and credentials - Consolidate duplicate training components into single reusable function - Add .env.example with all configurable options - Update README with environment variable documentation - Add Kubernetes secrets example for production deployments - Add timeout and error handling improvements BREAKING: Pipeline parameters now use env vars by default. Set RUNPOD_API_KEY, RUNPOD_ENDPOINT, S3_BUCKET, and AWS creds.
150 lines
4.2 KiB
Markdown
150 lines
4.2 KiB
Markdown
# Healthcare ML Training Pipeline
|
|
|
|
Serverless GPU training infrastructure for healthcare NLP models. Training runs on RunPod serverless GPUs, with trained models stored in S3.
|
|
|
|
## Overview
|
|
|
|
This project provides production-ready ML pipelines for training healthcare classification models:
|
|
|
|
- **Drug-Drug Interaction (DDI)** - Severity classification from DrugBank (176K samples)
|
|
- **Adverse Drug Events (ADE)** - Binary detection from ADE Corpus V2 (30K samples)
|
|
- **Medical Triage** - Urgency level classification
|
|
- **Symptom-to-Disease** - Diagnosis prediction (41 disease classes)
|
|
|
|
All models use Bio_ClinicalBERT as the base and are fine-tuned on domain-specific datasets.
|
|
|
|
## Training Results
|
|
|
|
| Task | Dataset | Samples | Accuracy | F1 Score |
|
|
|------|---------|---------|----------|----------|
|
|
| DDI Classification | DrugBank | 176K | 100% | 100% |
|
|
| ADE Detection | ADE Corpus V2 | 9K | 93.5% | 95.3% |
|
|
| Symptom-Disease | Disease Symptoms | 4.4K | 100% | 100% |
|
|
|
|
## Quick Start
|
|
|
|
### Run Training
|
|
|
|
```bash
|
|
curl -X POST "https://api.runpod.ai/v2/YOUR_ENDPOINT/run" \
|
|
-H "Authorization: Bearer $RUNPOD_API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"input": {
|
|
"task": "ddi",
|
|
"model_name": "emilyalsentzer/Bio_ClinicalBERT",
|
|
"max_samples": 10000,
|
|
"epochs": 3,
|
|
"batch_size": 16,
|
|
"s3_bucket": "your-bucket",
|
|
"aws_access_key_id": "...",
|
|
"aws_secret_access_key": "...",
|
|
"aws_session_token": "..."
|
|
}
|
|
}'
|
|
```
|
|
|
|
Available tasks: `ddi`, `ade`, `triage`, `symptom_disease`
|
|
|
|
### Download Trained Model
|
|
|
|
```bash
|
|
aws s3 cp s3://your-bucket/model.tar.gz .
|
|
tar -xzf model.tar.gz
|
|
```
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
├── components/
|
|
│ └── runpod_trainer/
|
|
│ ├── Dockerfile
|
|
│ ├── handler.py # Multi-task training logic
|
|
│ ├── requirements.txt
|
|
│ └── data/ # DrugBank DDI dataset
|
|
├── pipelines/
|
|
│ ├── healthcare_training.py # Kubeflow pipeline definitions
|
|
│ ├── ddi_training_runpod.py
|
|
│ └── ddi_data_prep.py
|
|
├── .github/workflows/
|
|
│ └── build-trainer.yaml # CI/CD
|
|
└── manifests/
|
|
└── argocd-app.yaml
|
|
```
|
|
|
|
## Configuration
|
|
|
|
All configuration is via environment variables. Copy `.env.example` to `.env` and fill in your values:
|
|
|
|
```bash
|
|
cp .env.example .env
|
|
# Edit .env with your credentials
|
|
```
|
|
|
|
### Environment Variables
|
|
|
|
| Variable | Required | Default | Description |
|
|
|----------|----------|---------|-------------|
|
|
| `RUNPOD_API_KEY` | Yes | - | RunPod API key |
|
|
| `RUNPOD_ENDPOINT` | Yes | - | RunPod serverless endpoint ID |
|
|
| `AWS_ACCESS_KEY_ID` | Yes | - | AWS credentials for S3 |
|
|
| `AWS_SECRET_ACCESS_KEY` | Yes | - | AWS credentials for S3 |
|
|
| `AWS_SESSION_TOKEN` | No | - | For assumed role sessions |
|
|
| `AWS_REGION` | No | us-east-1 | AWS region |
|
|
| `S3_BUCKET` | Yes | - | Bucket for model artifacts |
|
|
| `BASE_MODEL` | No | Bio_ClinicalBERT | HuggingFace model ID |
|
|
| `MAX_SAMPLES` | No | 10000 | Training samples |
|
|
| `EPOCHS` | No | 3 | Training epochs |
|
|
| `BATCH_SIZE` | No | 16 | Batch size |
|
|
|
|
### Kubernetes Secrets (Recommended)
|
|
|
|
For production, use Kubernetes secrets instead of environment variables:
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Secret
|
|
metadata:
|
|
name: ml-pipeline-secrets
|
|
type: Opaque
|
|
stringData:
|
|
RUNPOD_API_KEY: "your-key"
|
|
AWS_ACCESS_KEY_ID: "your-key"
|
|
AWS_SECRET_ACCESS_KEY: "your-secret"
|
|
```
|
|
|
|
### Supported Models
|
|
|
|
| Model | Type | Use Case |
|
|
|-------|------|----------|
|
|
| `emilyalsentzer/Bio_ClinicalBERT` | BERT | Classification tasks |
|
|
| `meta-llama/Llama-3.1-8B-Instruct` | LLM | Text generation (LoRA) |
|
|
| `google/gemma-3-4b-it` | LLM | Lightweight inference |
|
|
|
|
### Parameters
|
|
|
|
| Parameter | Default | Description |
|
|
|-----------|---------|-------------|
|
|
| `task` | ddi | Training task |
|
|
| `model_name` | Bio_ClinicalBERT | HuggingFace model ID |
|
|
| `max_samples` | 10000 | Training samples |
|
|
| `epochs` | 3 | Training epochs |
|
|
| `batch_size` | 16 | Batch size |
|
|
| `eval_split` | 0.1 | Validation split |
|
|
| `s3_bucket` | - | S3 bucket for output |
|
|
|
|
## Development
|
|
|
|
```bash
|
|
# Build container
|
|
cd components/runpod_trainer
|
|
docker build -t healthcare-trainer .
|
|
|
|
# Trigger CI build
|
|
gh workflow run build-trainer.yaml
|
|
```
|
|
|
|
## License
|
|
|
|
MIT
|