MsCAC Internship

Toronto

Engineering /

Intern /

Hybrid

Boson AI is an early-stage startup of 30 scientists. We are building large language tools for interaction and entertainment. Our founders, Alex Smola, Mu Li, and a team of Deep Learning, Optimization, NLP, AutoML and Statistics scientists and engineers are working on high quality generative AI models for language and beyond.

We are seeking interns from the University of Toronto to join us in our Toronto office. As part of your role, you will work on modeling and training LLMs, understanding and interpreting model behavior and aligning models to human values. The ideal candidate has interest and background in engineering, machine learning, and have motivations for developing state-of-the-art models towards AGI. Some potential research topics include the following:

Data extraction and annotation

Data collection and extraction for LLMs has come a long way. Even small models are now trained on 20+ trillion tokens, a significant fraction of all the text mankind ever created. Compare that to audio models trained on 10-100 million hours, the equivalent of 20-200 human lifetimes.

Your task is to help design and implement tools for automatic collection of data for a large number of languages (30+) and to build data extraction and annotation pipelines that do not rely heavily on human engineering. That is, you will use unsupervised and semi-supervised techniques to build tools that automatically (and iteratively) refine audio for large audio model training. See e.g. https://arxiv.org/abs/2505.13404 (Nvidia Granary) for a description of some of the more basic components. You will go beyond that by integrating annotation and model training into one iterative loop.

This project works for you if you are good at scripting, enjoy experimentation, large amounts of data and have a moderate understanding of statistics and machine learning.

Audio evaluation and benchmarking

Evaluating large audio models is unique insofar as it does not only rely on the correctness of the pronunciation but also on a wide range of stylistic elements, cadence, emotion, matching soundstage, etc.; In other words, it isn’t enough for the voice to sound good. It also needs to fit the context and the intended emotion and style. This requires the design of novel benchmarking tools.

Your task is to help design and implement such algorithms for both evaluating and improving audio models. See e.g. https://arxiv.org/abs/2505.23009 (Boson EmergentTTS-Eval) for a description of some of the challenges. An improved benchmark should be multilingual, it should be able to address and assess both global and local issues in the sound generated and it should be able to assess conversational audio, rather than only monolingual generation. We will use the scoring methodology both to evaluate models and (after modification) within the training of models proper.

This project works for you if you are good at experimentation, ideally speak more than one language, and have a good ear for audio. You should be comfortable with python and scripting languages for automated evaluation and benchmarking.

Text evaluation and benchmarking

Evaluating modern LLMs presents a critical paradox. While frontier models like GPT-5 show amazing capabilities on a range of tasks, their deployment in high-stakes, real-world applications is often blocked by crucial failures. These include (a) persistent hallucinations, especially when context is complex or imperfectly provided; (b) instability, or the inability to produce the same correct, high-quality result reliably over millions of runs; and (c) difficulty in robustly following a large number of constraints (e.g., 100+ instructions) within a long context (e.g., 20K+ tokens). While mitigations like reasoning or multi-agent workflow might solve a problem 1 in 100 attempts, real-world applications, especially enterprise cannot tolerate the associated latency or cost, demanding 100/100 accuracy. So how to probe the maximum capabilities and failure modes of current LLMs is essential for moving the entire field of generative AI forward.

Your task is to help design and implement new evaluation mechanisms and benchmarks that assess LLMs for these critical, deployment-blocking issues. The benchmark should involve human-in-the-loop, have a more verifiable outcome, and simulate complicated real-world use cases. See e.g. https://arxiv.org/pdf/2406.12045 (tau-bench) as an example.

This project works for you if you are good at experimentation, have creativity, and strong attention to detail. You should be comfortable with building an agentic pipeline and enjoy thinking critically about how and why models fail, not just if they get an answer right.

Model training and efficient optimization

Efficient model training is one of the most important aspects of building large multimodal models. This includes a wide range of aspects: numerical precision and datatypes (FP4, FP8, BF16, etc.), efficient kernel fusion, managing the different bandwidths of network, memory and buses, dealing with parallelization bubbles, checkpointing and much more. Furthermore, new optimization algorithms, such as Muon have improved performance lately.

Your task is to help design and implement improved training algorithms. Boson AI has its own datacenter (512 H100 using Infiniband and 384 A100 using Ethernet) and designs its own models, such as Higgs Audio 2 https://boson.ai/blog/higgs-audio-v2 which uses custom components on top of a transformer architecture. Improvements in our training loop should increase the numerical efficiency of our solvers (i.e. get closer to peak performance), improve recovery from failures (better checkpointing), improve the optimization algorithms (beyond Muon https://kellerjordan.github.io/posts/muon/) and build better fused kernels to deal with specific architectures that aren’t entirely default Huggingface Transformers.

This project works for you if you are interested in systems work, optimization, you enjoy writing CUDA/Triton or similar code, and you want to gain experience in training on a large number of GPUs. Strong coding skills are a necessity.

Model serving and efficient inference

Model serving is a deceptively simple task: at its heart it requires the system to receive an input to be fed through a model (ideally on a GPU) and to return the answer of the computation. Unfortunately, things get complicated in practice: for LLMs we want to preserve the KV (key, value) cache wherever possible, we want to maximize GPU utilization, we want to minimize the time to the first token, and we want to ensure that the models fit (and execute) efficiently on GPUs with limited memory.

Your task is to make Boson AI’s model serving on our own A100 and H100 GPUs (and on other GPUs in the cloud) as efficient as possible, using SGLang (https://github.com/sgl-project/sglang) or VLLM (https://github.com/vllm-project/vllm). On the model optimization side, you will explore inference in reduced precision models (8 bit, potentially 4 bit quantization), kernel optimization and memory layout optimization (between GPU and CPU) for efficient KV-caching. On the systems side, you will work on ensuring affinity and management of the KV-cache when multiple users use the same hardware. Moreover, you will work on dynamically rescaling capacity based on demand.

This project works for you if you are interested in systems work, have some experience with GPU/CPU hardware, interest in working with CUDA/Triton and you want to gain experience in working with large numbers of servers / GPUs. Strong coding skills are a necessity.

Conversational Dialog Design

While modern LLMs excel at open-ended chat, building reliable, goal-oriented conversational agents remains a complex engineering challenge. This is because real-world applications (like booking, support, or sales) must balance the LLM's natural language flexibility with the need for robust, auditable, and accurate workflow execution. Developers struggle to (1) define and manage the dialogue's many stages and transitions; it is difficult to (2) enforce "hard rules" and constraints (e.g., policy compliance) without sacrificing conversational smoothness; and they struggle to (3) manage short- and long-term memory to ensure context is perfectly maintained. This is one of the reasons for why systems such as AWS Lex and Google DialogFlow have struggled to gain significant traction.

Your task is to help design and implement a declarative framework that simplifies the creation of complex, goal-oriented agents. The goal is to separate the workflow logic (dialog state manager) from the LLM interaction logic. You will help build a system where developers can define conversational flows, states, transitions, and business rules in a high-level, easy-to-understand language (akin to a 'StateFlow' machine). The LLM will then be constrained to operate within this framework, handling the flexible language understanding and generation while the framework ensures the dialogue path, state, and rules are followed precisely. See e.g. https://arxiv.org/abs/2407.05674 (Genie worksheet) and https://arxiv.org/abs/2403.11322 (StateFlow).

This project works for you if you are a strong logical and systems-level thinker. It will help if you have backend experience in designing interaction systems. Moreover, expertise in human-computer interaction and dialogue system design is a plus. Equally, it’s great if you enjoy creating stories or building roleplay games.

Reinforcement Learning for Agents

Agents are usually ‘trained’ by talented engineers getting prompts ‘just right’. Needless to say, this is more of an art than a science, difficult to scale and deploy reliably. Moreover, even though we might do this many times for multiple customers, the underlying LLM won’t get any smarter from it. This has led to research on how to improve the LLM directly for the purpose of orchestrating and managing agentic workflows.

Your task is to apply RL for agents, building on recent progress, namely Flow-GRPO by https://arxiv.org/abs/2510.05592. This work specifically trains the orchestrating and agentic components of a system (rather than tuning the LLM that handles the front-line dialog). Our goal is to train multiple components jointly, while interacting with humans using the system. This will lead to higher quality results throughout the entire pipeline.

This project works for you if you are familiar with basic components of Large Language Models already and have some basic knowledge of Reinforcement Learning. Moreover, you should enjoy working with real-world data, collecting observations from user traces and building advanced agentic systems (this is a challenging but exciting project).

Emotions, Theory of Mind and Reinforcement Learning

Much of model alignment and training has been carried out with a moderately neutral stance regarding how a user might react to it. In reality, we are all different and will respond differently to the same stimuli. This requires models to adapt to the needs and desires of individuals. Fortunately, humans provide ample nonverbal cues in addition to the spoken word. At present, this is an entirely underdeveloped area. Only comparatively coarse grained psychological or social science based studies exist (due to the need of human data collection scalable studies are simply impossible).

Your task is to design algorithms that are capable of longer-term dialog planning, e.g. using Monte Carlo Tree Search https://arxiv.org/abs/2305.13660 and which are capable of modeling the internal state of the user (i.e. a theory of mind) https://www.pnas.org/doi/10.1073/pnas.2405460121. This will lead to more empathetic and also more successful agents to address a user’s needs.

This project works for you if you enjoy experimentation, reinforcement learning, and if you have a keen interest in human behavior. Basic Python programming skills are highly desirable.

Face Codec

The next frontier for conversational AI is moving beyond voice-only interaction to embodiment in real-time, realistic human avatars. This project, "Face Codec," aims to create photorealistic "talking heads" that are indistinguishable from real humans, not 3D cartoon models. The primary goal is to develop a lightweight, real-time generation pipeline where an audio stream drives a neural avatar model. Success requires overcoming three core challenges: (1) Photorealism: Generating a high-fidelity, realistic human appearance, including subtle expressions, not just a rendered 3D mesh. (2) Real-time Performance: Achieving low-latency generation (e.g., >30 fps) to enable interactive, two-way conversation. (3) Lip Sync Accuracy: Ensuring perfect, millisecond-level synchronization between the input audio and the avatar's lip movements, which is critical for user trust and perceived realism.

Your task is to implement the data pipeline and model architecture for audio-driven talking head synthesis. In terms of the data pipeline, you will contribute to multiple components, including but not limited to data collection, multi-stage data filtering and data synthesis. See e.g. https://arxiv.org/abs/2412.03603 (Hunyuan video) and https://arxiv.org/abs/2503.20314 (Wan audio). In terms of model architecture, you will design an efficient system that can take audio, text or other conditions into model training/inference, and a distilled diffusion model that can perform realtime inference. See e.g., https://arxiv.org/abs/2501.00103 (LTX-video) and https://arxiv.org/abs/2508.18621 (Wan-S2V).

This project works for you if you have a strong background in image/video generation, especially GANs, VAEs, diffusion models, NeRFs, or graphics. You should be comfortable with large-scale data collection, filtering, and model optimization.

$70,000 - $100,000 one-time

Compensation is in line with the MScAC guidelines. We encourage students to apply to MITACS for additional resources.

Please apply if you want to work on exciting topics in generative AI, artificial intelligence and advanced statistics.

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

Apply for this job