Staff Platform Engineer

Bengaluru, India
Engineering – DevOps /
Full time /
On-site
At Hevo, we are changing the way companies leverage data to drive user experience, growth, and business processes. 
There has been a fundamental change in the amount of data companies are generating on a day-to-day basis. More and more users in an organisation are now looking to use data to drive business decisions. Data is no longer a second-class citizen, and companies are seeing data as a competitive advantage. We see this change, and we are on a mission to change the way companies leverage their data. 
With a technology platform processing more than 100 billion records a month and doubling itself every 6 months, Hevo is poised for exponential growth in the coming future. This position provides a unique opportunity to create a massive impact on all existing and future customers of Hevo through technology innovation. 
We are looking for people who believe in challenging the status quo and are ready to be a part of this change. If you are the one who is looking to take a leap of faith and work on the technology of the future, if you obsess over customer satisfaction and experience, then we are looking for you.


What We Do

🚀 We implement high throughput data pipelines using Kafka and Java.
💖 We build the world's prettiest and most intuitive user interfaces using React, Angular, Typescript, and other OSS libraries.
💪 We use a variety of other Open Source technologies, including MySQL, Redis, InfluxDB, and more.
✨ We write reusable, efficient, and highly concurrent code. We are proud of the technology we build, but we are not dogmatic about our techniques. 
💭 We frequently re-evaluate our decisions and proactively make improvements to avoid last-minute chaos.

What you’ll be doing

Responsible for maintaining the Hevo platform with a steadfast commitment to achieving 99.99% uptime, overseeing a network of over 10,000 pipelines distributed across diverse regions and cloud environments.

Prioritize the stability of Pipelines as the foremost objective, serving as the guardians of production readiness. In the event of incidents, respond swiftly to resolve them, ensuring minimal disruption to our valued customers.

Develop highly scalable tools and services that proactively detect and address production issues, bolstered by the creation of comprehensive Standard Operating Procedures (SOPs) for SysOps teams to follow. Dedicate 50% of time and effort to coding initiatives.

Demonstrate unwavering ownership and accountability regarding timelines, system uptime, and production Service Level Agreements (SLAs).

Forge and maintain pivotal event lifecycle management processes, harnessing the power of observability tools such as Datadog, NewRelic, and Appdynamics.

Serve as a trusted advisor to leadership, contributing insights and influence across multifaceted technical domains, including applications, networks, information security, databases, operating systems, and web technologies, addressing complex business needs across various departments.

Tackle challenges with a strategic mindset, collaborating with professionals and managers at all organisational levels.

Safeguard business continuity by minimising the impact of failures and enabling rapid recovery.

Facilitate troubleshooting by introducing essential observability and tracing tools, streamlining the onboarding process for new customers, encompassing network connectivity, VPC peering, site-to-site VPNs, and more.

Spearhead the advancement of cloud maturity within our products, validating against well-architected frameworks to drive cost optimization, enhance performance efficiency, and fortify security measures.

Deliver data-driven solutions that are not only cost-effective and scalable but also reflect long-term vision, creativity, innovation, and advanced analytical thinking.

Research, design, and implement solutions for fault tolerance, performance optimization, capacity management, and configuration control across diverse cloud operations.

Cultivate a repository of reusable components, templates, and intricate code segments to expedite project delivery.

Maintain an unwavering commitment to personal growth and development, encompassing both technical and product-oriented domains.

Establish ownership and accountability as cornerstones of your approach, both in terms of project timelines and the steadfast maintenance of system uptime and production SLAs.

Implement data-driven practices by curating and constructing comprehensive metrics encompassing system performance, infrastructure, platform health, and business impact.

Actively engage in mentoring and guiding fellow team members, fostering a culture of continuous learning and improvement.

Participate diligently in on-call duties and contribute expertise in the resolution of major issues to ensure the resilience of our systems.



Key Requirements

    • 8+ years of experience in building scalable, highly critical distributed systems.
    • B.Tech in Computer Science or equivalent from a reputed college.
    • Excellent programming skills in Python / Go or Ruby or any other popular language.  Shell scripting is de facto.
    • Encouraging and building automated processes wherever possible.
    • Experience in working on highly interdependent and complex multi-service architecture.
    • Strong in Networking (triaging, packet loss, routing, protocols, TCP/IP stack), OS, and Docker / Containerization
    • Knowledge of Distributed Systems' fundamental principles (architectures, micro-services, high availability, elections)
    • Thorough understanding of cloud service delivery infrastructure ecosystem, operational processes, and orchestration models
    • Excellent skills in investigating and troubleshooting complicated systems/platforms and identifying key points of failure.
    • Monitoring & Logging best practices
    • Experience in configuration/infra provisioning management systems: Ansible, Terraform.