Senior Site Reliability Engineer – Distributed Systems
Listed on 2026-01-02
-
IT/Tech
Cloud Computing, Systems Engineer
About the role
As a Site Reliability Engineer, you will make an impact by designing and implementing observability solutions tailored for distributed edge computing environments. You will be a valued member of the Technology & Engineering team and work collaboratively with cross-functional teams to ensure system reliability, performance, and visibility across remote facilities.
In this role, you will- Design and implement observability frameworks for edge computing environments, including monitoring, logging, tracing, and metrics collection.
- Define and maintain SLIs, SLOs, and business KPIs to measure and enhance system reliability across edge and centralized infrastructure.
- Build dashboards, visualizations, and alerting systems for real-time insights and incident response.
- Implement distributed tracing and log aggregation systems to troubleshoot complex edge issues.
- Collaborate with engineering teams to embed observability best practices into edge applications and infrastructure.
- Proactively identify issues using advanced observability tools, reducing MTTD and MTTR.
- Lead incident postmortems and implement observability-driven improvements.
- Develop automation scripts and tools to optimize observability pipelines for bandwidth-constrained environments.
- Optimize data storage and querying strategies for performance, cost, and scalability.
- Stay current with emerging observability trends and advocate for adoption of edge-specific solutions.
At Cognizant, we strive to provide flexibility wherever possible, and we are here to support a healthy work-life balance through our various wellbeing programs. Based on this role’s business requirements, this is an onsite position requiring 5 days a week in a client or Cognizant office.
Please note:
This role will require an in-person meet and greet at our Cognizant office or client location.
The working arrangements for this role are accurate as of the date of posting. This may change based on the project you’re engaged in, as well as business and client requirements. Rest assured; we will always be clear about role expectations.
What you need to have to be considered- 10+ years of IT experience
- 3–5 years of experience in service reliability/operations for large-scale hybrid environments.
- 3–5 years of experience writing automation scripts and building dashboards for application performance management.
- 2–4 years of experience with programming languages such as Go, Python, Java, or Rust.
- Working knowledge of databases such as Oracle, SQL Server, Redis, Click House, Postgre
SQL, Mongo
DB, or time-series databases. - At least 2 years of experience with cloud platforms and containerization (GCP, AWS, Rancher, Azure, Open Shift).
- Experience maintaining containerized apps in GKE/RKE/AKE environments.
- Experience implementing cloud observability using Open Telemetry (OTEL).
- Experience with Graph
QL frameworks (Apollo, Prisma, Hasura). - Strong understanding of networking protocols (TCP/IP, HTTP, DNS, load balancing, service mesh).
- Proven experience managing application availability and building automation for high-availability platforms.
- Hands-on experience with monitoring tools like Splunk, App Dynamics, Grafana/Prometheus, and Dynatrace.
- Experience with CI/CD tools and extenders such as Rally and Confluence.
- Experience with in-memory caching solutions (Redis preferred).
- Strong debugging skills across integrated technical platforms and API gateways.
- Hands-on experience with GCS, Cloud SQL, Spanner, and Firestore.
- Experience in enterprise-level infrastructure and operations.
- Expertise in high-availability and distributed systems, Linux/Windows administration, and support.
- Experience monitoring and troubleshooting Hashi Corp Vault environments.
- Working knowledge of Vertex AI, Gen AI, and Big Query.
Bachelor’s degree in computer science, IT or equivalent
Salary and Other CompensationThe annual salary for this position is depending on experience and other qualifications of the successful candidate.
This position is also eligible for Cognizant’s discretionary annual incentive program, based on performance and subject to the terms of Cognizant’s applicable plans.
Benefits:
Cognizant offers the following benefits for this position, subject to applicable eligibility requirements:
- Medical/Dental/Vision/Life Insurance
- Paid holidays plus Paid Time Off
- 401(k) plan and contributions
- Long-term/Short-term Disability
- Paid Parental Leave
- Employee Stock Purchase Plan
Disclaimer:
The salary, other compensation, and benefits information is accurate as of the date of this posting. Cognizant reserves the right to modify this information at any time, subject to applicable law.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).