Manager,Site Reliability Engineering Job Noida area,Uttar Pradesh India,IT/Tech

Are you our ' TYPE '

Monotype Global

Named 'One of the Most Innovative Companies in Design' by Fast Company, Monotype brings brands to life through type and technology that consumers engage with every day.

The company's rich legacy includes a library that can be traced back hundreds of years, featuring famed typefaces like Helvetica, Futura, Times New Roman and more.

Monotype also provides a first-of-its-kind service that makes fonts more accessible for creative professionals to discover, license, and use in our increasingly digital world. We work with the biggest global brands, and with individual creatives, offering a wide set of solutions that make it easier for them to do what they do best: design beautiful brand experiences.

Monotype Solutions India

Monotype Solutions India is a strategic center of excellence for Monotype and is a certified Great Place to Work® three years in a row. The focus of this fast-growing center spans Product Development, Product Management, Experience Design, User Research, Market Intelligence, Research in areas of Artificial Intelligence and Machine learning, Innovation, Customer Success, Enterprise Business Solutions, and Sales.

Headquartered in the Boston area of the United States and with offices across 4 continents, Monotype is the world's leading company in fonts. It's a trusted partner to the world's top brands and was named 'One of the Most Innovative Companies in Design' by Fast Company.

Monotype brings brands to life through the type and technology that consumers engage with every day. The company's rich legacy includes a library that can be traced back hundreds of years, featuring famed typefaces like Helvetica, Futura, Times New Roman, and more. Monotype also provides a first-of-its-kind service that makes fonts more accessible for creative professionals to discover, license, and use in our increasingly digital world.

We are looking for an experienced and hands-on Site Reliability Engineering (SRE) Manager to lead the reliability, stability, and operational excellence of our enterprise platforms. This role will own both 24x7 incident management operations and SRE engineering efforts, ensuring high system availability, fast incident response, and continuous improvement of platform reliability.

You will lead a team responsible for maintaining uptime, reducing incidents, improving response times, and building a more proactive and self-sufficient SRE function. The role requires a balance of hands-on technical depth and people leadership, with a strong focus on automation, observability, release stability, and team maturity.

As we expand into AI-driven workloads, you will also support reliability, monitoring, and scalability of these systems.

What You'll Be Doing

Reliability & Incident Management

Own end-to-end reliability of production systems, ensuring uptime within defined SLAs
Lead and govern a 24x7x365 incident management team, ensuring quick response and resolution
Act as escalation point during critical incidents and drive coordination across teams
Ensure proper incident tracking, communication, and status page updates

Incident Improvement & RCA

Drive a strong blameless RCA culture across the team
Ensure all customer-impacting incidents are analysed with clear root causes
Track and drive closure of RCA action items to prevent repeat issues
Identify recurring patterns and push for permanent fixes

Observability & Monitoring

Own and improve observability using tools like Datadog, Cloud Watch, ELK, Prometheus
Guide teams on effective logging, alerting, and monitoring practices
Reduce alert noise and improve signal-to-noise ratio
Drive proactive monitoring and early detection of issues

Automation & Operational Efficiency

Drive automation to reduce manual effort and operational toil
Identify repetitive issues and build solutions to eliminate them
Ensure runbooks and playbooks are created and followed for recurring incidents

Release Stability & Production Readiness

Work with Product, Engineering & Platform teams to improve release quality and stability
Ensure proper readiness checks before production deployments (monitoring, rollback, alerts)
Reduce production issues caused by releases

AI Workload Reliability

Support reliability and monitoring of AI/ML workloads in production and experimentation environments.
Ensure visibility, stability, and cost awareness for AI-driven systems
Bring structure and best practices as AI adoption grows

Team Leadership & Development

Lead and mentor a team of 14 engineers across operations and SRE excellence
Build team maturity and reduce dependency on senior members
Develop strong ownership and accountability within the team

Cross-team Collaboration

Work closely with Engineering, Product and Platform teams
Ensure smooth coordination during incidents and releases
Communicate effectively with stakeholders during high-severity situations
Collaborate with stakeholders to align reliability and platform strategies with business goals

Cost & Efficiency

Partner with teams to optimize cloud…

Manager, Site Reliability Engineering