Software Engineer - ML Infrastructure

Mountain View, CA
Engineering /
Full Time /
On-site
In this role, you will help us design and build the core of our system working along with some of the best researchers from Stanford and other top universities. We are adapting and inventing cutting edge ML techniques to fit well with learning over data warehouses. The technical challenge here is to create scalable storage and compute systems that will fit well with both the data warehouse and the ML training and inference systems and provide for scalability, reliability, restart and security of cloud-first applications. The final product will follow a customer first approach and be extremely easy to use from both ML and cloud deployment. You would work alongside seasoned engineering leaders with decades of experience.

The Value You'll Add:

    • Design the core of our training and inference systems
    • Design APIs between the system components to decouple them and make independent development easy
    • Produce designs that can be iterated over time to achieve more scalability
    • Combine strong software engineering principles with machine learning to build scalable, reproducible and easy-to-use end-to-end machine learning workflows for advanced deep learning problems
    • Build backend infrastructure to perform scalable training, evaluation, and inference in the cloud
    • Build comprehensive data management systems for scalable data collection, labeling, processing, and evaluation
    • Collaborate with product teams and engineers to make applications of machine learning ubiquitous 
    • Apply knowledge of GPU programming such as OpenCL or CUDA
    • Increase PyG and GNN efficiency through GPU programming

Your Foundation:

    • BA/BS Degree in Computer Science or related technical discipline, or related practical experience
    • 2+ years experience in software design, development, and algorithm related solutions
    • 2+ years experience programming in OOP languages like Python, or C++
    • Experience in CUDA programming and other GPU optimizations
    • Experience with Pytorch, Tensorflow, MXNet, ONNX, etc
    • Ability to collaborate and work well with others
    • Proven track record of operating highly-available systems at significant scale
    • Ability to proactively learn new concepts and apply them at work

Your Extra Special Sauce:

    • Knowledge of Cloud distributed storage/databases, file systems and distributed storage
    • Experience building Micro services & Cloud Platforms on AWS, Azure, NoSQL solutions, Memcache/Redis, kafka, Kubernetes
    • Experience with industry, open-source projects and/or academic research in large-data, parallel and distributed systems
    • Good knowledge in distributed data processing technologies
    • Experience in using and/or contributing to an inference serving technology, PyTorch, TensorFlow etc.
    • Experience in building machine learning platforms at large scale
    • Understand the fundamentals of machine learning, ideally in both academic and industry environments
    • Experience working on online ranking/recommendations systems
    • Experience building large scale production machine learning systems or data pipelines
    • Experience with backend services or distributed systems
    • Experience with infrastructure and large-scale system design
    • Experience building large scale production data pipelines
    • Familiarity with machine learning frameworks such as TensorFlow, Caffe2, PyTorch, Spark ML, scikit-learn, or related frameworks
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.