Infrastructure Resilience QA Engineer
Vilnius
Engineering – OxySERPS /
Full Time /
Hybrid
We’re a team of 500+ professionals who develop cutting-edge proxy and web data scraping solutions for thousands of the world’s best known businesses, including Fortune 500 companies.
What’s in store for you:
You’ll be solving complex challenges and maintaining our own infrastructure with 60PB+ monthly data traffic. Here are its scale and maturity in numbers:
- 6PB+ Ceph storage
- 60PB+ monthly data traffic through our systems
- 300k+ service requests/sec processed
- 500k+ Kafka messages/sec streamed
A word from the team:
Join us as a Chaos Engineer (Resilience QA Engineer) and become the guardian of reliability in our distributed system. You’ll design chaos experiments, uncover hidden weaknesses, and make our platform stronger against real-world failures. This is a hands-on role where your work directly impacts system uptime, customer trust, and engineering velocity. If you’re passionate about resilience, systems thinking, and pushing software beyond its limits, this is your chance to make a real difference.
Your day-to-day:
- Design and execute fault injection experiments (service crashes, latency, network partitions, resource exhaustion).
- Conduct load, stress, and soak testing of microservices and system components.
- Validate recovery strategies (circuit breakers, retries, failovers).
- Verify observability and monitoring coverage, highlighting blind spots.
- Automate resilience test suites and integrate them into CI/CD.
- Maintain resilience benchmarks (latency/error budgets).
- Collaborate with engineers, SREs, and QA to prioritize improvements.
- Provide clear reports with reproduction steps and impact assessments.
Check-out our tech stack here: stackshare.io/oxylabs/oxylabs
Your skills & experience:
- Programming/scripting skills for automation (Python, Go, Bash).
- Automated testing framework knowledge.
- Experience with chaos testing tools (Chaos Mesh, Gremlin, Litmus, etc.).
- Strong understanding of Kubernetes/Docker and microservice architectures.
- Familiarity with observability stacks (Prometheus, Grafana, ELK, OpenTelemetry).
- Knowledge of resilience design patterns (circuit breakers, retries, failover).
- Exposure to private cloud environments.
- Previous experience in high-availability environments.
Nice to have:
Salary:
- Gross salary: 3300 - 6500 EUR/month. Keep in mind that we are open to discussing a different salary based on your skills and experience.
Up for the challenge? Let’s talk!