Member of Technical Staff, Data Pipeline

Santa Clara HQ
Engineering /
Full-time /
On-site
Boson AI is an early-stage startup building large language tools for everyone to use. Our founders (Alex Smola,Mu Li), and a team of Deep Learning, Optimization, NLP, AutoML and Statistics scientists and engineers are working on high quality generative AI models for language and beyond.

We are seeking research scientists and engineers to join our team full-time in our Santa Clara office. As part of your role, you will work on implementing and training deep neural networks, understanding and interpreting model behavior and aligning models to human values. The ideal candidate will possess a strong background in machine learning, and have motivations for developing state-of-the-art models towards AGI. 

We encourage you to apply even if you do not believe you meet every single qualification. As long as you are motivated to learn and join the development of foundation models, we’d love to chat.

Responsibilities:

    • Design and develop data collection pipelines to gather and preprocess diverse datasets from various sources.
    • Design and develop data processing pipelines, including data labeling, data filtering, data cleaning, data visualization, data auditing, etc. 
    • Implement machine learning models to improve the quality and diversity of data.

You may be a good fit if you have:

    • Strong proficiency in building large-scale data processing pipelines, familiar with distributed workload (e.g., multiprocessing).
    • Proficiency in at least one programming language commonly used in machine learning, such as Python and ability to write clean, maintainable code.
    • Proficiency in at least one deep learning framework, such as PyTorch.
    • Bachelor's degree in computer science or equivalent.
    • Excellent problem-solving skills and attention to detail, especially when handling data anomalies and biases to further improve data quality.

Strong candidates may also have:

    • Familiar with at least one of the following tools for data labeling (e.g., LabelStudio), data collection (e.g., VPN, Selenium), data processing (e.g., Hadoop, Datasketch). 
    • Experience in building large-scale datasets.  
    • Hands-on experience in the cloud, like AWS, Azure or GCP.
    • Experience in machine learning, e.g., projects in language/vision/audio.
    • Active Github contributions are a big plus.
    • Multilingual which contributes to enriching the language diversity crucial for robust model training. 
    • Experience with fairness, toxicity, data privacy regulations and compliance considerations.