Genomics Data Engineer
You will work at the interface of genomics, big data engineering, and advanced analytics. The candidate will contribute to the expansion of our Apache Spark-based distributed analytics platform, building production-quality data processing infrastructure and developing scalable algorithms to analyze genomic and health data for diagnostics of rare genetic disease. This role encompasses engineering of end-to-end solutions that unify and structure diverse data sets efficiently and implement AI algorithms at scale to derive genomic insights in support of clinical applications.
The ideal candidate will have a strong background in computer science, data mining, machine learning, or a related field, with demonstrated experience in engineering scalable and performant data processing software in Spark or another distributed compute environment. In addition, previous experience in a life sciences domain or biotech is essential.
- Build out a big data distributed architecture capable of efficiently processing large-scale genomics data
- Develop and deploy bioinformatics/AI analysis algorithms at scale
- Build automated and production-quality data processing systems
- Interact and collaborate with scientists to clearly define and iterate on requirements
- Keep abreast of new state-of-the-art software data engineering and data science technologies
- Aggregate and analyze genomic and other types of clinical data to find novel insights.
- Develop code to implement analysis workflows in a robust and reproducible fashion.
- Follow processes to improve transparency and reliability of applications, reducing project risks for on-time milestones.
- Educate other scientists, engineers and management on the methods developed and how they apply to the subject domain and customer needs.
- This position requires a B.S. (M.S. or Ph.D. preferred) with 3 + years of experience in computer science, specializing in high-performance/distributed computing, data mining, machine learning, or bioinformatics.
- 3+ years of software engineering experience with proficiency in Python, Scala, Java or C/C++, programming languages.
- Knowledge of distributed compute technologies, such as Spark, Hadoop, map-reduce, MPI, or other parallel computing frameworks is essential.
- Strong foundation in data engineering, data science, and/or machine learning, with demonstrated experience applying these technologies at scale on real-world data sets.
- Knowledge of database technologies, indexing/partitioning, and SQL.
- Experience with cloud computing (AWS preferred).
- Experience engineering high volume data and scientific dataflows.
- Background of bioinformatics, life sciences, genomics or biology is highly preferred.
- Team player with excellent communication skills to effectively collaborate with multiple cross-functional teams of scientists, clinicians, and engineers.
Candidates must have pre-existing US work authorization.