Data Engineer - Senior Data Engineer

Bangalore, Karnataka

Technology – Risk – Tech /

Full-time Employment /

On-site

About Us:

Paytm is India’s largest digital payments and financial services platform, leading the mobile QR revolution. We power millions of businesses and individuals, and we’re building scalable, resilient systems to serve half a billion Indians and beyond.

Here at Paytm, technology isn't just about keeping up, it's about building the future. We believe the next generation of engineers must not only scale systems to billions but also leverage AI and GPT tools to accelerate innovation

About the Role:

We're looking for a Senior AI engineer with 3-6 years of experience, who can build production-grade agentic AI systems. You'll be working at the intersection of cutting-edge AI research and scalable engineering, creating autonomous agents that can reason, plan, and execute complex tasks reliably at scale

What We're Looking For:

Agentic AI & LLM Engineering

You should have hands-on experience with:

1) Multi-agent systems: Building agents that coordinate, communicate, and work together on complex workflows.

2) Agent orchestration: Designing systems where AI agents can plan multi-step tasks, use tools, and make autonomous decisions.

3) LLMOps Experience: End-to-End LLM Lifecycle Management - hands-on experience managing the complete LLM workflow from prompt engineering and dataset curation through model fine-tuning, evaluation, and deployment. This includes versioning prompts, managing training datasets, orchestrating distributed training jobs, and implementing automated model validation pipelines. Production LLM Infrastructure - experience building and maintaining production LLM serving infrastructure including model registries, A/B testing frameworks for comparing model versions, automated rollback mechanisms, and monitoring systems that track model performance, latency, and cost metrics in real-time.

4) AI Observability: Experience implementing comprehensive monitoring and tracing for AI systems, including prompt tracking, model output analysis, cost monitoring, and agent decision-making visibility across complex workflows.

5) Evaluation frameworks: Creating comprehensive testing for agent performance, safety, and goal achievement.

6) LLM inference optimization: Scaling model serving with techniques like batching, caching, and efficient frameworks (vLLM, TensorRT-LLM)

Systems Engineering

Strong backend development skills including:

1) Python expertise: FastAPI, Django, or Flask for building robust APIs that handle agent workflows

2) Distributed systems: Microservices, event-driven architectures, and message queues (Kafka, RabbitMQ) for agent coordination

3) Database strategy: Vector databases, traditional SQL/NoSQL, and caching layers optimized for agent state management

4) Web-scale design: Systems handling millions of requests with proper load balancing and fault tolerance

DevOps (Non-negotiable)

1) Kubernetes: Working knowledge required - deployments, services, cluster management

2) Containerization: Docker with production optimization and security best practices

3) CI/CD: Automated testing and deployment pipelines

4) Infrastructure as Code: Terraform, Helm charts

5) Monitoring: Prometheus, Grafana for tracking complex agent behaviors

Programing Language : Java , Python

What You'll Build

You'll architect the infrastructure that powers our autonomous AI systems:

Agent Orchestration Platform: Multi-agent coordination systems that handle complex, long-running workflows with proper state management and failure recovery.

Evaluation Infrastructure: Comprehensive frameworks that assess agent performance across goal achievement, efficiency, safety, and decision-making quality.

Production AI Services: High-throughput systems serving millions of users with intelligent resource management and robust fallback mechanisms.

Training Systems: Scalable pipelines for SFT and DPO that continuously improve agent capabilities based on real-world performance and human feedback.

Ideal Profile:

1) You've spent serious time in production environments building AI systems that actually work. You understand the unique challenges of agentic AI - managing state across long conversations, handling partial failures in multi-step processes, and ensuring agents stay aligned with their intended goals.

2) You've dealt with the reality that the hardest problems aren't always algorithmic. Sometimes it's about making an agent retry gracefully when an API call fails, or designing an observability layer that catches when an agent starts behaving unexpectedly, or building systems that can scale from handling dozens of agent interactions to millions.

3) You're excited about the potential of AI agents but pragmatic about the engineering work required to make them reliable in production.

Preferred Qualifications: Bachelor's/Master's Degree in Computer Science or equivalent

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

Apply for this job