Data Engineer- Senior Data Engineer

Bangalore, Karnataka
Technology – Risk – Tech /
Full-time Employment /
On-site
The Role

We're looking for a senior AI engineer who can build production-grade agentic AI systems. You'll be working at the intersection of cutting-edge AI research and scalable engineering, creating autonomous agents that can reason, plan, and execute complex tasks reliably at scale.

What We Need 

Agentic AI & LLM Engineering
You should have hands-on experience with:
Multi-agent systems: Building agents that coordinate, communicate, and work together on complex workflows
Agent orchestration: Designing systems where AI agents can plan multi-step tasks, use tools, and make autonomous decisions
LLMOps Experience: End-to-End LLM Lifecycle Management - hands-on experience managing the complete LLM workflow from prompt engineering and dataset curation through model fine-tuning, evaluation, and deployment. This includes versioning prompts, managing training datasets, orchestrating distributed training jobs, and implementing automated model validation pipelines. Production LLM Infrastructure - experience building and maintaining production LLM serving infrastructure including model registries, A/B testing frameworks for comparing model versions, automated rollback mechanisms, and monitoring systems that track model performance, latency, and cost metrics in real-time.

AI Observability: Experience implementing comprehensive monitoring and tracing for AI systems, including prompt tracking, model output analysis, cost monitoring, and agent decision-making visibility across complex workflows.
Evaluation frameworks: Creating comprehensive testing for agent performance, safety, and goal achievement
LLM inference optimization: Scaling model serving with techniques like batching, caching, and efficient frameworks (vLLM, TensorRT-LLM)
Systems Engineering
Strong backend development skills including:
Python expertise: FastAPI, Django, or Flask for building robust APIs that handle agent workflows
Distributed systems: Microservices, event-driven architectures, and message queues (Kafka, RabbitMQ) for agent coordination
Database strategy: Vector databases, traditional SQL/NoSQL, and caching layers optimized for agent state management
Web-scale design: Systems handling millions of requests with proper load balancing and fault tolerance

DevOps (Non-negotiable)
Kubernetes: Working knowledge required - deployments, services, cluster management
Containerization: Docker with production optimization and security best practices
CI/CD: Automated testing and deployment pipelines
Infrastructure as Code: Terraform, Helm charts
Monitoring: Prometheus, Grafana for tracking complex agent behaviors
Programing Language : Java , Python

What You'll Build
You'll architect the infrastructure that powers our autonomous AI systems:
Agent Orchestration Platform: Multi-agent coordination systems that handle complex, long-running workflows with proper state management and failure recovery.
Evaluation Infrastructure: Comprehensive frameworks that assess agent performance across goal achievement, efficiency, safety, and decision-making quality.
Production AI Services: High-throughput systems serving millions of users with intelligent resource management and robust fallback mechanisms.
Training Systems: Scalable pipelines for SFT and DPO that continuously improve agent capabilities based on real-world performance and human feedback.

Who You Are
You've spent serious time in production environments building AI systems that actually work. You understand the unique challenges of agentic AI - managing state across long conversations, handling partial failures in multi-step processes, and ensuring agents stay aligned with their intended goals.
You've dealt with the reality that the hardest problems aren't always algorithmic. Sometimes it's about making an agent retry gracefully when an API call fails, or designing an observability layer that catches when an agent starts behaving unexpectedly, or building systems that can scale from handling dozens of agent interactions to millions.
You're excited about the potential of AI agents but pragmatic about the engineering work required to make them reliable in production.