Job Description & How to Apply Below
Location: Bengaluru
Job Summary
The role is accountable for tactical and operational support for production services, across one or more areas of specific platform/domain.
To ensure maximum service quality and stability through fast and effective response to technical incidents, and to be a catalyst for change via analysis and identification of continual service improvement opportunities. Depending on the area of technical specialisation, in addition to incident resolution and prevention, it may also be involved in a control capacity to ensure that new changes to the technology estate do not introduce instability.
Manage technical resumption of high priority, S@R, medium/high severity incidents, provide end-to-end support and implement resolution to resolve incidents within SLA
Provide root cause analysis for S@R, medium/high severity issues, ensure all follow up action points are carried out
Responsible for the stability of the production system. Direct second and third level of support for problem diagnosis and resolution as per the agreed SLA's.
Responsible for managing the production related changes, releases and rollouts with zero or minimal impact to the stability of the application. Review the dependent changes of the surround systems, infrastructure, networking etc... Responsible for ensuring proper technical plans are in place for all production changes (e.g. fallback plan, implementation plan, data conversion etc...)
Create and update Production Support documentation, contingency (DR/BCP) documentation and processes.
Provide inputs to PSS manager for monthly dashboard that provide information on incident and problem trends along with SIP and RCA Action Items.
Participate & support in cross-training and knowledge transfer activities within support teams
Key Responsibilities
Service stability and incident management
Ensure maximum service quality and stability through prompt and effective response to technical incidents.
Act as a catalyst for change by performing incident and problem analysis, identifying root causes, and driving continual service improvement (CSI) initiatives.
Where relevant, perform a control function to ensure that new technology changes do not introduce instability into the production environment.
Monitoring and observability
Drive and achieve 'north star' monitoring and observability goals.
Build comprehensive monitoring, alerting, and logging are in place for critical services, enabling proactive detection and rapid remediation of issues.
Automation and operational excellence
Automation of operational tasks such as deployments, monitoring, scaling, and infrastructure management to reduce manual effort and operational risk.
Site Reliability Engineering (SRE) practices
Troubleshoot issues and participate in incident response, troubleshooting, and post-incident reviews (post-mortems) to minimise downtime and institutionalise learning from failures.
Optimise infrastructure, systems, and processes for performance, efficiency, and reliability.
Contribute to the design and implementation of robust deployment pipelines and release strategies that enable smooth, frequent, and reliable releases (e.g. blue/green, canary).
Change, release, and rollout management
Review and implement production-related changes, releases, and rollouts with zero or minimal impact to application stability and client experience.
Review and coordinate dependent changes across surrounding systems, infrastructure, networks, and shared services.
Ensure thorough technical plans are in place for all production changes, including implementation steps, fallback/rollback strategies, data conversion or migration plans, and validation checks.
Reporting and continuous improvement
Drive closure of remediation actions to prevent recurrence of incidents.
Collaboration, coaching, and knowledge sharing
Participate in and support cross-training and structured knowledge transfer activities within and across support and engineering teams.
Leverage AI and automation for production engineering
Use AI-driven tools (e.g. for log analysis, anomaly detection, alert correlation, and capacity forecasting) to proactively identify, diagnose, and resolve production…
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×