High Performance Computing (HPC) Engineer
Palo Alto, CA
Engineering /
Full Time /
On-site
Submit your application
Resume/CV
✱
ATTACH RESUME/CV
Couldn't auto-read resume.
Analyzing resume...
Success!
File exceeds the maximum upload size of
100MB
. Please try a smaller size.
Full name
✱
Email
✱
Phone
✱
Current location
✱
No location found. Try entering a different location
Loading
Current company
✱
Links
LinkedIn URL
✱
GitHub URL
Portfolio URL
Twitter URL
Genbio
Work Permit
Are you authorized to work in the country where this position is based?
Yes
No
If you don't have work authorization, will you require work authorization sponsorship?
Yes
No
Relocation
Are you willing to relocate if you do not live close to the local office of choice?
Yes
No
How soon can you move to the area of the local office?
within 1 month after offer signed
1-3 month after offer signed
3-6 months after offer signed
Short Answer
Briefly describe the most relevant project you have worked on. Be sure to outline your specific contributions.
HPC
What is your experience level with managing and optimizing GPU clusters for large-scale ML workloads?
Advanced – I have independently deployed and managed GPU clusters, including installation, resource scheduling (e.g., SLURM), monitoring, and performance tuning for distributed ML workloads.
Intermediate – I’ve helped configure or maintain GPU clusters and can monitor and troubleshoot jobs, but haven’t independently built or optimized clusters.
Beginner – I’ve run jobs on existing clusters (e.g., via SLURM or cloud platforms) but have not configured or managed them.
No experience – I have not worked with GPU clusters.
What is your experience level with distributed deep learning and parallel training of large models (e.g., with PyTorch, DeepSpeed, Megatron-LM)?
Advanced – I have implemented distributed training pipelines across multiple nodes/GPUs using frameworks like DeepSpeed, FSDP, or Megatron-LM, and have tuned synchronization strategies and batch scheduling for scale.
Intermediate – I’ve run or adapted distributed training scripts using tools like PyTorch DDP or HuggingFace Accelerate but didn’t build or optimize them myself.
Beginner – I’ve used standard training scripts or single-GPU setups, but have not worked on parallel or multi-node training.
No experience – I have not worked on distributed model training.
What is your experience level with resource scheduling and containerized orchestration (e.g., SLURM, Kubernetes) for ML or HPC environments?
Advanced – I have designed and managed resource scheduling or orchestration workflows using SLURM, Kubernetes, or similar for large-scale ML/HPC workloads, including autoscaling and multi-user environments.
Intermediate – I have worked with job schedulers or Kubernetes in preconfigured environments but did not configure them myself.
Beginner – I have basic familiarity or followed tutorials using schedulers or containers for small-scale experiments.
No experience – I have not worked with scheduling or container orchestration systems.
Additional information
Submit application