AI Platform Site Reliability Engineering Specialist
Apply now »Date: Feb 2, 2026
Location: Bengaluru, KA, IN
Company: NTT DATA Services
Req ID: 354116
NTT DATA strives to hire exceptional, innovative and passionate individuals who want to grow with us. If you want to be part of an inclusive, adaptable, and forward-thinking organization, apply now.
We are currently seeking a AI Platform Site Reliability Engineering Specialist to join our team in Bengaluru, Karnātaka (IN-KA), India (IN).
What you'll do in the role:
Below is a sample of potential responsibilities depending on product/focus area:
- Operate, monitor, and maintain the infrastructure supporting GenAI applications ( training, inference, feature store, data ingestion, model serving)
- Design and build automation for core platform capabilities, reducing manual toil
- Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
- Establish, monitor and enforce SLOs/SLIs/LSAs, error budgets, alerting, and dashboards
- Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
- Perform capacity planning, scaling strategies, workload scheduling and resource forecasting
- Optimize cost vs. performance trade-offs in large-scale compute environments
- Harden systems for security, compliance, auditability, and data governance
- Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems
- Define disaster recover (DR) strategies, back/restore practices, fault tolerance mechanisms
- Maintain runbooks, operation playbooks, documentation, and training materials
- Participate in on-call rotations and respond to production incidents 24/7 as needed
- Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability
What you'll bring to the role:
- Bachelor's or Master's degree in Computer Science or related field, or equivalent job experience
- 5 years of production experience in SRE / Infrastructure / ops for large-scale systems
- Strong programming/scripting skills (Python, Go, Java, or equivalent)
- Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
- Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
- Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
- Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
- Networking and systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
- Solid experience in capacity planning, performance tuning, scaling, and incident response
- Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
- Experience in regulated environments (financial services, compliance, audit, security) is a strong plus
- Excellent communication, documentation, and cross-team collaboration skills
- Proven track record of reducing operational toil via automation
Nice to have:
- Understanding of SRE techniques
- Proficiency with Open Telemetry tools including Grafana, Loki, Prometheus, and Cortex
- Good knowledge of Microservice based architecture, industry standards, for both public and private cloud
- Knowledge of data pipeline technologies (Kafka, Spark, Flink, etc.)
- Good knowledge of various DB engines (SQL, Redis, Kafka, Snowflake, etc.) for cloud app storage
- Experience working with Generative AI development, embeddings, fine tuning of Generative AI models
- Experience in high-performance computing (HPC), distributed GPU cluster scheduling (e.g. Slurm, Kubernetes GPU scheduling)
- Understanding of ModelOps / ML Ops / LLM Op
- Experience with chaos engineering, canary deployments, blue/green rollouts
About NTT DATA
NTT DATA is a $30 billion business and technology services leader, serving 75% of the Fortune Global 100. We are committed to accelerating client success and positively impacting society through responsible innovation. We are one of the world's leading AI and digital infrastructure providers, with unmatched capabilities in enterprise-scale AI, cloud, security, connectivity, data centers and application services. our consulting and Industry solutions help organizations and society move confidently and sustainably into the digital future. As a Global Top Employer, we have experts in more than 50 countries. We also offer clients access to a robust ecosystem of innovation centers as well as established and start-up partners. NTT DATA is a part of NTT Group, which invests over $3 billion each year in R&D.
Whenever possible, we hire locally to NTT DATA offices or client sites. This ensures we can provide timely and effective support tailored to each client’s needs. While many positions offer remote or hybrid work options, these arrangements are subject to change based on client requirements. For employees near an NTT DATA office or client site, in-office attendance may be required for meetings or events, depending on business needs. At NTT DATA, we are committed to staying flexible and meeting the evolving needs of both our clients and employees. NTT DATA recruiters will never ask for payment or banking information and will only use @nttdata.com and @talent.nttdataservices.com email addresses. If you are requested to provide payment or disclose banking information, please submit a contact us form, https://us.nttdata.com/en/contact-us.
NTT DATA endeavors to make https://us.nttdata.com accessible to any and all users. If you would like to contact us regarding the accessibility of our website or need assistance completing the application process, please contact us at https://us.nttdata.com/en/contact-us. This contact information is for accommodation requests only and cannot be used to inquire about the status of applications. NTT DATA is an equal opportunity employer. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability or protected veteran status. For our EEO Policy Statement, please click here. If you'd like more information on your EEO rights under the law, please click here. For Pay Transparency information, please click here.
Job Segment:
Cloud, Computer Science, Consulting, Database, SQL, Technology