Staff Platform Engineer - High Performance Computing Infrastructure Platform Management

Singapore, Singapore
Cloud Infrastructure and Services – Cloud Infrastructure & Services /
Full-time /
On-site
You will be part of the dynamic team responsible for building resilient network infrastructure using cutting-edge technologies such as cloud-based and software-defined networking e.g. SD-WAN, ACI and NSX. You must have a good understanding of IT infrastructure systems, and knowledge in the latest networking technologies and platforms. You will be a technical specialist in a team, and must be keen to take on new challenges and keep abreast with rapidly evolving technology landscape.

Role

    • We are seeking an experienced HPC Staff Engineer to join our team, responsible for managing and optimizing our HPC infrastructure platform. The successful candidate will have a deep understanding of HPC systems, architectures and technologies, as well as experience with managing large-scale computing environments. The role will involve designing, implementing and maintaining the HPC infrastructure platform, ensuring high availability, scalability and performance.

Responsibilities

    • Lead a team to deliver resilient, scalable and secure HPC platform, including compute nodes, storage systems, networks and job scheduling systems. 
    • Lead, design, implement and manage the HPC infrastructure platform to meet organisational needs.
    • Design and implement storage solutions for HPC workloads to ensure efficient data storage and retrieval.
    • Design and implement high-performance networking solutions, including InfiniBand, Ethernet, and other interconnects.
    • Plan and manage HPC resource capacity, including forecasting, procurement and deployment of new hardware and software.
    • Manage HPC clusters, including optimizing, monitoring and troubleshooting cluster performance, as well as managing job scheduling and resource allocation. 
    • Ensure the security and compliance of the HPC infrastructure platform, including managing access controls, implementing security patches, and conducting regular security checks.
    • Collaborate with stakeholders like data scientists and developers to optimize application performance on the HPC platform and provide technical support on using the HPC infrastructure platform.

Requirements (Minimum Qualifications)

    • Bachelor's degree in Computer Science, Computer Engineering, or a related field.
    • 8+ years of experience in managing HPC systems, including experience with Linux, Unix, or other operating systems.
    • Strong knowledge of HPC architectures, including clusters, grids, and clouds.
    • Experience with HPC job scheduling systems, such as Slurm, Torque and LSF.
    • Strong understanding of storage systems, including SANs, NAS, and object storage.
    • Experience with high-performance networking, including InfiniBand, Ethernet, and other interconnects.
    • Experience with cloud computing platforms, such as AWS, Azure, or Google Cloud.
    • Experience with scripting languages, such as Python, Perl, or Bash.
    • Experience with containerization (Docker, Kubernetes) and proficient in a range of complementary technologies, including Knative, Run:AI, Grafana, Prometheus, Kyverno, ArgoCD, Rancher, NVIDIA BCM and knowledge of NVIDIA Superpod architecture.
    • Experience in leading engineering teams.

Nice to Have

    • Certifications in NVIDIA AI Infrastructure and Operations, and Certified Kubernetes Administrator.
    • Experience with machine learning or deep learning frameworks, such as TensorFlow or PyTorch.
    • Familiarity with agile development methodologies and version control systems, such as Git.

Why join us?

    • The work is purposeful and meaningful 
    • You will work with the best engineers 
    • We work with modern technologies and tech stacks 
    • We have excellent engineering culture and work-life balance 
    • We aspire to engineering and operational excellence 
    • We empower to innovate 
    • We grow together as a family 
As CSIT is an agency under the Ministry of Defence (Singapore), only Singapore Citizens will be considered.