Senior Storage System Engineer - Supercomputing

Sunnyvale, CA

Engineering /

Full-time /

On-site

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

The Role

As a Storage Systems Engineer on the IFM Supercomputing Team, you will design, build, and optimize high-performance storage systems to support some of the most advanced GPU supercomputing clusters in academia. These clusters power both AI training and inference workloads, requiring exceptional reliability, scalability, and low-latency data access.

Job Responsibilities

Architect and implement distributed and parallel file systems (e.g., Lustre, DDN, VAST) optimized for large-scale AI and HPC workloads.
Ensure seamless integration of storage with compute clusters managed by Slurm, Kubernetes and other orchestration systems.
Optimize I/O performance for high-throughput, low-latency access using modern storage technologies (NVMe, SSD) and parallel file systems.
Collaborate with infrastructure teams to enhance deployment pipelines using Infrastructure-as-Code (IaC) tools, ensuring reproducibility and reliability.
Monitor and maintain storage systems across on-premise and hybrid environments, proactively addressing performance bottlenecks and system failures.
Contribute to capacity planning, fault tolerance, and data durability strategies aligned with IFM’s growing computational demands.

Tech Stack

Lustre or similar parallel file systems.
Ceph, ZFS, Minio, S3, GCS, or similar distributed storage systems.
Slurm and Kubernetes or similar scheduler.
Pulumi, Terraform, Ansible
NVMe, SSD, HDD technologies

Professional Experience

Proven experience designing and operating large-scale distributed or parallel storage systems (e.g., Lustre, DDN, VAST, Ceph, ZFS) in HPC or AI environments.
Strong familiarity with storage hardware (NVMe, SSD, HDD) and performance tuning in high-throughput, compute-intensive clusters.
Experience working with Slurm and Kubernetes workload manager in production HPC environments.
Track record of working in large-scale supercomputing environments—ideally at national labs (e.g., LLNL, CSCS), top universities (e.g., Stanford), major tech firms (e.g., xAI, Meta, AWS), or enterprise vendors (e.g., NVIDIA, HPE, DDN).
Proficiency in developing storage-related tooling or monitoring solutions using Go or Rust.
Experience managing storage infrastructure via Infrastructure-as-Code (e.g., Terraform, Pulumi, Ansible).
Bonus: Familiarity with AI/ML data workflows and large-scale dataset handling.

$200,000 - $400,000 a year

Salary depends on level.

Visa Sponsorship

This position is eligible for visa sponsorship.

Benefits Include

*Comprehensive medical, dental, and vision benefits

*Bonus

*401K Plan

*Generous paid time off, sick leave and holidays

*Paid Parental Leave

*Employee Assistance Program

*Life insurance and disability

Apply for this job