Genomics Data Engineer

Oakland
Engineering
Full-time
You will work at the interface of genomics, big data engineering, and advanced analytics. The candidate will contribute to the expansion of our Apache Spark-based distributed analytics platform, building production-quality data processing infrastructure and developing scalable algorithms to analyze genomic and health data for diagnostics of rare genetic disease. This role encompasses engineering of end-to-end solutions that unify and structure diverse data sets efficiently and implement AI algorithms at scale to derive genomic insights in support of clinical applications.

The ideal candidate will have a strong background in computer science, data mining, machine learning, or a related field, with demonstrated experience in engineering scalable and performant data processing software in Spark or another distributed compute environment. In addition, previous experience in a life sciences domain or biotech is essential. 

Responsibilities

    • Build out a big data distributed architecture capable of efficiently processing large-scale genomics data
    • Develop and deploy bioinformatics/AI analysis algorithms at scale
    • Build automated and production-quality data processing systems
    • Interact and collaborate with scientists to clearly define and iterate on requirements
    • Keep abreast of new state-of-the-art software data engineering and data science technologies
    • Aggregate and analyze genomic and other types of clinical data to find novel insights. 
    • Develop code to implement analysis workflows in a robust and reproducible fashion.
    • Follow processes to improve transparency and reliability of applications, reducing project risks for on-time milestones.
    • Educate other scientists, engineers and management on the methods developed and how they apply to the subject domain and customer needs.  

Qualifications

    • This position requires a B.S. (M.S. or Ph.D. preferred) with 3 + years of experience in computer science, specializing in high-performance/distributed computing, data mining, machine learning, or bioinformatics.
    • 3+ years of software engineering experience with proficiency in Python, Scala, Java or C/C++, programming languages. 
    • Knowledge of distributed compute technologies, such as Spark, Hadoop, map-reduce, MPI, or other parallel computing frameworks is essential.
    • Strong foundation in data engineering, data science, and/or machine learning, with demonstrated experience applying these technologies at scale on real-world data sets.
    • Knowledge of database technologies, indexing/partitioning, and SQL.
    • Experience with cloud computing (AWS preferred).
    • Experience engineering high volume data and scientific dataflows.
    • Background of bioinformatics, life sciences, genomics or biology is highly preferred.
    • Team player with excellent communication skills to effectively collaborate with multiple cross-functional teams of scientists, clinicians, and engineers.
Candidates must have pre-existing US work authorization.