CUDA Kernels Engineer
Palo Alto, CA
Engineering /
Full Time /
On-site
Submit your application
Resume/CV
✱
ATTACH RESUME/CV
Couldn't auto-read resume.
Analyzing resume...
Success!
File exceeds the maximum upload size of
100MB
. Please try a smaller size.
Full name
✱
Email
✱
Phone
✱
Current location
✱
No location found. Try entering a different location
Loading
Current company
✱
Links
LinkedIn URL
✱
GitHub URL
Portfolio URL
Twitter URL
Genbio
Work Permit
Are you authorized to work in the country where this position is based?
Yes
No
If you don't have work authorization, will you require work authorization sponsorship?
Yes
No
Relocation
Are you willing to relocate if you do not live close to the local office of choice?
Yes
No
How soon can you move to the area of the local office?
within 1 month after offer signed
1-3 month after offer signed
3-6 months after offer signed
Short Answer
Briefly describe the most relevant project you have worked on. Be sure to outline your specific contributions.
Performance Engineering
What is your experience level with writing and optimizing GPU kernels using CUDA or similar low-level programming frameworks (e.g., Triton, OpenCL)?
Advanced – I have independently written and optimized custom CUDA or Triton kernels for performance-critical applications, and understand warp-level programming, memory hierarchy, and performance profiling.
Intermediate – I have written or modified CUDA/Triton kernels and used them in ML or HPC workflows, but optimization and debugging were supported by others.
Beginner – I’ve experimented with CUDA or similar frameworks in tutorials or coursework but haven’t deployed anything in a real system.
No experience – I have never written or optimized GPU kernels.
What is your experience level with AI accelerators or GPU/CPU hardware architecture and performance optimization?
Advanced – I deeply understand GPU/CPU architecture (e.g., memory bandwidth, SIMD, registers, cache), and have optimized software to maximize hardware utilization in ML or HPC workloads.
Intermediate – I have a solid understanding of GPU or accelerator performance characteristics and have used tools like Nsight, perf, or VTune for optimization guidance.
Beginner – I’m familiar with general GPU concepts and performance tuning ideas but haven’t optimized software for specific hardware architectures.
No experience – I haven’t worked on performance optimization with awareness of hardware internals.
What is your experience level with foundation model architectures and training infrastructure (e.g., Transformers, LLMs)?
Advanced – I have worked closely with training infrastructure and optimization for LLMs or other foundation models and understand architecture-level tradeoffs that affect training efficiency.
Intermediate – I’ve worked with Transformer-based models or training pipelines but not in a performance or systems optimization role.
Beginner – I have implemented or fine-tuned Transformer models but haven’t explored the system-level or architectural aspects.
No experience – I haven’t worked with LLMs or foundation models.
Additional information
Submit application