ML Framework Engineer

Las Vegas, Nevada
Engineering /
Full Time /
Hybrid
At Tensorwave, we’re leading the charge in AI compute, building a versatile cloud platform that’s driving the next generation of AI innovation. We’re focused on creating a foundation that empowers cutting-edge advancements in intelligent computing, pushing the boundaries of what’s possible in the AI landscape.

Job Description:
TensorWave is seeking an ML Framework Engineer to lead the integration, optimization, and maintenance of PyTorch (and select AI libraries) on AMD ROCm GPUs. This role is critical in ensuring our AI cloud platform remains at the cutting edge of performance, stability, and compatibility by tracking upstream framework changes, debugging compatibility issues, and automating builds, testing, and benchmarking. You will be responsible for maintaining a registry of validated AI libraries, debugging low-level performance issues, and working with external maintainers to upstream fixes. You will collaborate with DevOps, MLOps, and AI researchers to ensure a seamless deployment and development experience across TensorWave’s infrastructure. This role is ideal for an engineer with deep PyTorch internals knowledge, strong GPU debugging experience, and a passion for optimizing AI workloads at the framework level.

Responsibilities

    • Framework Compatibility & Versioning: Track PyTorch and other AI framework updates, maintain a versioned registry of validated builds, and proactively handle breaking changes.
    • Kernel Debugging & Profiling: Triage and debug ROCm-related issues affecting AI workloads, handling small fixes directly and escalating complex issues to MLOps and third-party maintainers.
    • Build & CI/CD Automation: Develop and maintain automated build pipelines for AI frameworks, integrating regression testing and benchmarking, while working with DevOps for large-scale automation.
    • Performance Optimization: Profile and analyze AI workload performance on AMD GPUs, identifying bottlenecks in memory access, kernel execution, and framework overhead.
    • Third-Party Collaboration: Work with PyTorch maintainers, ROCm engineers, and external AI library contributors to improve framework compatibility and push upstream fixes when needed.
    • Container & Environment Management: Maintain and update prebuilt AI container environments, ensuring seamless integration with TensorWave’s inference and training infrastructure.
    • Documentation & Knowledge Sharing: Serve as the SME (Subject Matter Expert) for library compatibility, maintaining internal documentation on framework versions, known issues, and best practices.

Essential Skills & Qualifications

    • 3+ years of experience in ML framework development, optimization, or GPU debugging.
    • Strong expertise in PyTorch internals, model execution, and AI framework architecture.
    • Experience with ROCm or CUDA development, including kernel debugging and profiling.
    • Proficiency in Python and C++, with experience in optimizing AI workloads at the framework level.
    • Familiarity with low-level GPU performance profiling tools (rocprof, Nsight, perf, VTune, etc.).
    • Hands-on experience with CI/CD for AI frameworks, including automated testing and benchmarking.
    • Strong understanding of containerization (Docker, Kubernetes) and dependency management (pip, Conda, Bazel, CMake, etc.).
    • Excellent documentation skills, with a focus on library versioning, compatibility tracking, and regression analysis.

Preferred Qualifications

    • Experience contributing to PyTorch or other open-source ML frameworks.
    • Prior experience maintaining a private pip or Conda package registry for AI software.
    • Familiarity with distributed training, model parallelism, and mixed precision training.
    • Knowledge of LLM-specific optimizations, such as quantization and tensor parallel execution.
    • Exposure to high-performance computing (HPC) environments for AI workloads.
We’re looking for resilient, adaptable people to join our team—folks who enjoy collaborating and tackling tough challenges. We’re all about offering real opportunities for growth, letting you dive into complex problems and make a meaningful impact through creative solutions. If you're a driven contributor, we encourage you to explore opportunities to make an impact at Tensorwave. Join us as we redefine the possibilities of intelligent computing.

What We Bring:
In addition to a competitive salary, we offer a variety of benefits to support your needs, including:
Stock Options
100% paid Medical, Dental, and Vision insurance 
Life and Voluntary Supplemental Insurance
Short Term Disability Insurance
Flexible Spending Account
401(k)
Flexible PTO
Paid Holidays
Parental Leave
Mental Health Benefits through Spring Health