Senior Software Engineer, Reliability Engineering (US)

Sunnyvale, CA or Seattle, WA

Engineering /

Fulltime /

Hybrid

About Onehouse

Onehouse is a mission-driven company dedicated to freeing data from data platform lock-in. We deliver the industry’s most interoperable data lakehouse through a cloud-native managed service built on Apache Hudi. Onehouse enables organizations to ingest data at scale with minute-level freshness, centrally store it, and make available to any downstream query engine and use case (from traditional analytics to real-time AI / ML).

We are a team of self-driven, inspired, and seasoned builders that have created large-scale data systems and globally distributed platforms that sit at the heart of some of the largest enterprises out there including Uber, Snowflake, AWS, Linkedin, Confluent and many more. Riding off a fresh $35M Series B backed by Craft, Greylock and Addition Ventures, we're now at $68M total funding and looking for rising talent to grow with us and become future leaders of the team. Come help us build the world's best fully managed and self-optimizing data lake platform!

The Community You Will Join

When you join Onehouse, you're joining a team of passionate professionals tackling the deeply technical challenges of building a 2-sided engineering product. Our engineering team serves as the bridge between the worlds of open source and enterprise: contributing directly to and growing Apache Hudi (already used at scale by global enterprises like Uber, Amazon, ByteDance etc) and concurrently defining a new industry category - the transactional data lake. The Reliability Engineering team is the glue that binds all of this together. You will be responsible for developing and maintaining the tools and systems that enable our engineering teams to operate our services reliably and at scale. You will closely cross functionally partner with our engineering teams to ensure our services are able to scale with our growing business.

The Impact You Will Drive:

At Onehouse, you will own our entire live production infrastructure and operational posture to run massive data systems at scale.
Ensure our services remain resilient by identifying opportunities for improvement and drive their implementation.
Identify opportunities to improve our overall operational efficiency and growing by owning the modern tools in our cloud-only operation and our practices for proactive automation, monitoring and response.
Acting as a mentor to guide cross-functional teams during crisis situations and ensure timely resolution, minimizing the impact on our customers and business.

A Typical Day:

Build and own our reliability engineering practice from the ground up, owning our entire production infrastructure and operational posture.
Establish a culture of reliability across engineering by providing a comprehensive incident management platform that is being used for instrumentation, operability, and around incidents.
Design, implement and maintain new services, tools, and monitoring to support service reliability and alerting.
Serve as an active member of our SRE team, responding to and managing high severity incidents or any situations concerning the wellbeing and continuous operation of our mission-critical systems.
Collaborate with your stakeholders across engineering teams to ensure continuous adoption of best practices, rollout scenarios for the space, and that services are designed with reliability in mind.
Continuously analyze and evaluate the tradeoffs of the existing designs and make recommendations based on new technologies and industry best practices.
Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management and launch reviews.
Maintain services once they are live by measuring and monitoring availability, latency and overall system health through an intimate understanding of how the critical parts of our site work.
Contribute to better incident management posture and retrospectives, driving improvements in our overall reliability and incident response time as well as on-call runbooks and post-mortem reports.
Drive our compliance posture; ensuring that all our products and processes comply with relevant regulations and standards, especially during compliance audits.

What You Bring to the Table:

Bachelor's degree in Computer Science or related field.
7+ years of experience in software engineering or SRE roles, with a focus on large scale distributed systems.
Strong coding skills in at least one programming language, such as Java, Python, or Go.
Strong conviction in software development best practices, including version control, automated testing, and continuous integration and delivery.
Excellent problem-solving, triaging, and debugging skills in large-scale distributed systems.
Experience with managing kubernetes clusters and applications at scale.
Experience deploying applications on one or more cloud platforms such as AWS, Google Cloud Platform or Microsoft Azure.
Experience defining and owning reliability focussed systems and processes (e.g. Incident Management, Post-mortem).
Experience with software development related compliance processes (e.g. Soc2, FedRAMP).
Experience with the following tech stack:
Infrastructure-as-code (e.g. Terraform, Cloudformation)
Automation frameworks (e.g. Jenkins, CircleCI)
Monitoring stacks (e.g. Prometheus and ELK)
Cloud security management (e.g IAM, SSO)
Data processing technologies like Spark

How We'll Take Care of You

-Competitive Compensation; the estimated base salary range for this role is $150,000 - $220,000

-Equity Compensation; our success is your success with eligible participation in our company equity plan

-Health & Well-being; we'll invest in your physical and mental well-being with up to 90% health coverage (50% for spouses/dependents) including comprehensive medical, dental & vision benefits

-Financial Future; we'll invest in your financial well-being by making this role eligible to contribute to our company 401(k) or Roth 401(k) retirement plan

-Location; we are a remote-friendly company (internationally distributed across N. America + India), though some roles will be subject to in-person requirements in alignment with the needs of the business

-Generous Time Off; unlimited PTO (mandatory 1 week/year minimum), uncapped sick days and 11 paid company holidays

-Company Camaraderie; Annual company offsites and Quarterly team onsites @Sunnyvale HQ

-Food & Meal Allowance; weekly lunch stipend, in-office snacks/drinks

-Equipment; we'll provide you with the equipment you need to be successful and a one-time $500 stipend for your initial desk setup

-Child Bonding!; 8 weeks off for parents (birthing, non-birthing, adoptive, foster, child placement, new guardianship) - fully paid so you can focus your energy on your newest addition

House Values

One Team

Optimize for the company, your team, self - in that order. We may fight long and hard in the trenches, take care of your co-workers with empathy. We give more than we take to build the one house, that everyone dreams of being part of.

Tough & Persevering

We are building our company in a very large, fast-growing but highly competitive space. Life will get tough sometimes. We take hardships in the stride, be positive, focus all energy on the path forward and develop a champion's mindset to overcome odds. Always day one!

Keep Making It Better Always

Rome was not built in a day; If we can get 1% better each day for one year, we'll end up thirty-seven times better. This means being organized, communicating promptly, taking even small tasks seriously, tracking all small ideas, and paying it forward.

Think Big, Act Fast

We have tremendous scope for innovation, but we will still be judged by impact over time. Big, bold ideas still need to be strategized against priorities, broken down, set in rapid motion, measure, refine, repeat. Great execution is what separates promising companies from proven unicorns.

Be Customer Obsessed

Everyone has the responsibility to drive towards the best experience for the customer, be an OSS user or a paid customer. If something is broken, own it, say something, do something; never ignore. Be the change that you want to see in the company.

Pay Range Transparency

Onehouse is committed to fair and equitable compensation practices. Our job titles may span more than one career level. The pay range(s) for this role is listed above and represents the base salary range for non-commissionable roles or on-target earnings for commissionable roles. Actual compensation packages are dependent upon several factors that are unique to each candidate, including but not limited to: job-related skills, depth of transferable experience, relevant certifications and training, business needs, market demands and specific work location. Based on the factors above, Onehouse utilizes the full width of the range; the base pay range is subject to change and may be modified in the future. The total compensation package for this position will also include eligibility for equity options and the benefits listed above.

Apply for this job