Lead Site Reliability Engineer T500-23189 Job Hyderabad area,Telangana India,IT/Tech

Position: Lead Site Reliability Engineer [T500-23189]
About Inspire Brands:
Inspire Brands is disrupting the restaurant industry through digital transformation and operational efficiencies. The company’s technology hub, Inspire Brands Hyderabad Support Center, India, will lead technology innovation and product development for the organization and its portfolio of distinct brands. The Inspire Brands Hyderabad Support Center will focus on developing new capabilities in data science, data analytics, eCommerce, automation, cloud computing, and information security to accelerate the company’s business strategy.

Inspire Brands Hyderabad Support Center will also host an innovation lab and collaborate with start-ups to develop solutions for productivity optimization, workforce management, loyalty management, payments systems, and more.

POSITION SUMMARY:

Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, distributed, fault-tolerant systems enabling online ordering for thousands of restaurants across multiple brands. SRE ensures that Inspire Digital Platform (IDP) services have reliability, uptime appropriate to users' needs and a fast rate of improvement. Additionally, SRE’s will keep an ever-watchful eye on our systems capacity and performance.

SRE is also responsible to perform regular capacity planning exercise. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating toil through automation.

ESSENTIAL

JOB RESPONSIBILITIES:

Responsibility
Technical
Mentoring / Technical Escalations
Education

Knowledge and Skills (General and Technical):
Review current workload patterns, understand the business case and prioritize areas of weakness within the platform through log and metric investigation as well as application profiling.
Work with senior engineering and testing team members to build tools and recommend testing strategies for problem prevention, detection.
Employ deep troubleshooting skills to improve the availability, performance, and security to ensure services are designed with 24/7 availability and operational readiness and rigor.
Perform in depth postmortem on production incidents, to assess effective business impact and for Engineering to learn from these.
Create Dashboards and alerts for Monitoring the IDP platform, define key metrics and service level indicators and ensure relevant metric data is collected to create actionable alerts for SRE and Network Operation Center.
Participate in the 24/7 on call rotation.
Automate toil, by building software and automation for seamless application deployment and third-party tool integration.
Ensure the platform holds a high degree of reliability, at least three 9s.
Define non-functional requirements as part of the product lifecycle to influence the new designs, standards, and methods for scalable, highly available distributed systems.
own technically intricate issues that cross between Dev Ops, Databases, Networking, Code, Infrastructure and people; drive them to satisfactory completion.
Provide recommendations and feedback in design reviews and review sessions.
Mentor and guide junior members of team.
Identify gaps and create a curated technology learning path for team members.
Troubleshooting and triage of technical roadblocks for scheduled deliverables.

KNOWLEDGE, SKILLS AND ABILITIES:

4-year degree in computer science, Information Technology, or related field.
Minimum 10+ years of experience as a Software Engineer, Platform, SRE or Dev Ops engineer supporting large scale SAAS Production B2C or B2B Cloud Platforms.
Hands-on problem-solving and troubleshooting.
Minimum 10 years of experience as a Software Engineer, Platform, SRE or Dev Ops engineer supporting large scale SAAS Production B2C or B2B Cloud Platforms.
Development skills, Java, Type Script, python, OOP expertise is a must.
Hands on Azure Cloud experience particularly with AKS, API management, Azure Cache for Redis, Azure Blob Storage, Cosmo DB, Service Bus, Azure Functions.
Proficiency in monitoring, APM and profiling tools, New Relic, Splunk, Prometheus, Grafana.

Working experience with containers, Kubernetes and Helm.
Functional knowledge of Cloud Network,…


Increase/decrease your Search Radius (miles)



Job Posting Language