Founding Data Infrastructure Engineer (Query Engine)

San Mateo, CA
Infrastructure /
Full Time /
On-site
About Us
We are on a mission to bridge the gap between enterprise business knowledge and data, democratizing data discovery and curation to prepare organizations for the era of generative AI. Today's data tools are overly complex, poorly integrated, and siloed, forcing AI Practitioners and data scientists alike to spend more time wrestling with tools, relying on tribal knowledge, and navigating data lakes rather than doing meaningful data science work. The current landscape of data tools and processes is heavily manual and needs to catch up with the vast amount of data generated daily. With the advent of Gen AI and multi-modality, this challenge has only grown more complex and broken.

Backed by top VC funds, we are committed to making enterprise data AI-ready faster, more reliably, and with a stronger foundation of factual semantic knowledge. This leads to more accurate models, superior outcomes, and better business results. Our team of seasoned data infrastructure and machine learning experts (from LinkedIn, Visa, Truera, Hive, and Branch) has spent the past two decades building bespoke systems to solve these very challenges.

Join our growing team of ML research and data infrastructure experts. We're committed to empowering AI and data scientists to seamlessly integrate semantic learning with generative AI. Be part of our journey to shape the future of enterprise AI.

Who You Are

    • Thrives in early-stage environments, eager to build robust systems from scratch.
    • Passionate about distributed systems and solving complex data challenges at scale.
    • Able to navigate ambiguity and adapt to changing requirements in a fast-paced startup.
    • Advocates for engineering efficiency and continuous improvement.
    • A leader who enjoys mentoring others and fostering a strong engineering culture.
    • Excited to work cross-functionally with a team that values transparency, purpose-driven innovation, and collective leadership.

What You Will Be Doing

    • Design and Develop: Build scalable, fault-tolerant query engines optimized for performance and resource efficiency.
    • Optimize Performance: Apply advanced techniques such as vectorized processing, cost-based optimization, and caching to enhance query execution.
    • Integrate Seamlessly: Develop integrations with modern data lake formats (e.g., Apache Iceberg, Delta Lake, and Hudi) and semantic layers.
    • Innovate in Distributed Systems: Architect solutions to handle concurrency, scalability, and reliability in distributed environments.
    • Leverage Open Source: Contribute to or extend platforms like Apache Spark, Presto, and Trino to meet unique product requirements.
    • Collaborate: Work with product, data science, and engineering teams to align technical solutions with business needs.
    • Stay Current: Research and implement the latest advancements in query processing and distributed systems.

Prior Experience

    • Proficiency in programming languages such as Java, Scala, Rust, or C++.
    • Deep understanding of query engine internals, distributed systems architecture, and parallel query processing.
    • Experience with modern big data technologies (e.g., Apache Spark, Presto, Trino) and data formats like Parquet, ORC, or Avro.
    • Proven ability to build and optimize scalable systems capable of processing petabyte-scale datasets.
    • Strong grasp of SQL semantics, execution plans, and query optimization techniques.

Nice to Have

    • Experience in building AI/ML infrastructure and ML production systems at scale.
    • Hands-on experience with Linux, Docker, and other containerization technologies.
    • Prior experience at an early-stage startup, developing systems and processes from scratch.