More jobs:
Senior Site Reliability Engineer
Job in
Stamford, Fairfield County, Connecticut, 06925, USA
Listed on 2026-05-31
Listing for:
Castleton Commodities International, LLC
Full Time
position Listed on 2026-05-31
Job specializations:
-
IT/Tech
Systems Engineer, Cloud Computing, SRE/Site Reliability
Job Description & How to Apply Below
Responsibilities
- Reliability Engineering & Operations - Own and improve service reliability through SLO/SLI definition, error budgets, and operational best practices.
- Design, implement, and maintain observability (monitoring, logging, tracing, alerting) to reduce MTTR and improve proactive detection.
- Lead incident response practices including on-call improvements, runbooks, post-incident reviews (RCA), and preventative actions.
- Partner with application teams to improve performance, capacity planning, and resiliency under failure scenarios.
- Infrastructure & Cloud Architecture - Design and operate highly available, fault-tolerant Cloud architectures (multi-AZ and, where required, multi-region); implement resilient patterns across compute, storage, networking, and managed services (autoscaling, load balancing, backups, replication).
- Drive cloud governance best practices (tagging, account/landing zone patterns, least privilege, guardrails) in partnership with security and platform teams.
- Infrastructure as Code (IaC) & Dev Ops Enablement - Build and maintain IaC modules and standards (Terraform, Cloud Formation, CDK); develop, standardize, and optimize CI/CD pipelines to enable safe, automated deployments (Git Hub Actions, Git Lab CI, Jenkins, AWS Code Pipeline); promote Dev Ops practices: version-controlled infrastructure, automated testing, immutable deployments, progressive delivery patterns; establish environment consistency across dev/test/stage/prod and ensure infrastructure drift detection and remediation.
- BCP/DR, RTO/RPO Definition & Testing - Collaborate with stakeholders to evaluate and define service-level RTO and RPO targets; design and implement BCP/DR architectures and procedures; coordinate structured DR tests (tabletop, simulation, partial failover, full failover) and document outcomes; maintain DR runbooks, dependency maps, and recovery checklists; produce metrics and reporting on DR readiness, test results, and continuous improvement actions.
- Ability to work effectively in a fast-paced, dynamic and high-intensity environment, with timely responsiveness and the ability to work beyond normal business hours when required.
- 7+ years of experience in SRE, Dev Ops, Platform Engineering, or Systems Engineering roles supporting production environments.
- Strong proficiency with observability platforms (Datadog, Prometheus/Grafana, ELK/Open Search, Nagios, Nimsoft, etc).
- Strong hands‑on AWS experience building and operating production systems.
- Proven expertise with Infrastructure as Code (Terraform and/or Cloud Formation/CDK).
- Strong CI/CD and automation background (pipeline design, deployment strategies, testing automation).
- Experience defining and validating RTO/RPO, and implementing BCP/DR plans with structured testing.
- Experience with Kubernetes and auto-scaling container platforms (EKS, ECS, or Kubernetes on-prem).
- Strong Linux fundamentals, networking concepts (DNS, TCP/IP, load balancing), and troubleshooting skills.
- Proficiency in at least one scripting/programming language (Python, Go, Bash, or similar).
- Ability to write clear operational documentation, runbooks, and post-incident reports.
- Familiarity with Azure and/or Oracle Cloud (OCI).
- Familiarity with Service Mesh, API Gateways, and distributed tracing tooling.
- Familiarity with Open Telemetry, client instrumentations and collector configurations.
- Security and compliance familiarity in cloud environments (IAM design, secrets management, audit logging).
- Experience implementing progressive delivery (blue/green, canary), feature flags, and automated rollback.
- Relevant certifications (AWS Solutions Architect/Dev Ops Engineer, Kubernetes CKA/CKAD).
- Experience with ArgoCD & Karpenter.
- Competitive comprehensive medical, dental, retirement and life insurance benefits.
- Employee assistance & wellness programs.
- Parental and family leave policies.
- CCI Community volunteering program - 2 days annually to volunteer at selected charities.
- Charitable contribution match program.
- Tuition assistance & reimbursement.
- Quarterly Innovation & Collaboration Awards.
- Employee discount program, including access to fitness facilities.
- Competitive paid time off.
- Continued learning opportunities.
Position Requirements
10+ Years
work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×