×
Register Here to Apply for Jobs or Post Jobs. X

Senior Site Reliability Engineer

Job in Stamford, Fairfield County, Connecticut, 06925, USA
Listing for: Castleton Commodities International, LLC
Full Time position
Listed on 2026-05-31
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing, SRE/Site Reliability
Salary/Wage Range or Industry Benchmark: 120000 - 160000 USD Yearly USD 120000.00 160000.00 YEAR
Job Description & How to Apply Below

Responsibilities

  • Reliability Engineering & Operations - Own and improve service reliability through SLO/SLI definition, error budgets, and operational best practices.
  • Design, implement, and maintain observability (monitoring, logging, tracing, alerting) to reduce MTTR and improve proactive detection.
  • Lead incident response practices including on-call improvements, runbooks, post-incident reviews (RCA), and preventative actions.
  • Partner with application teams to improve performance, capacity planning, and resiliency under failure scenarios.
  • Infrastructure & Cloud Architecture - Design and operate highly available, fault-tolerant Cloud architectures (multi-AZ and, where required, multi-region); implement resilient patterns across compute, storage, networking, and managed services (autoscaling, load balancing, backups, replication).
  • Drive cloud governance best practices (tagging, account/landing zone patterns, least privilege, guardrails) in partnership with security and platform teams.
  • Infrastructure as Code (IaC) & Dev Ops Enablement - Build and maintain IaC modules and standards (Terraform, Cloud Formation, CDK); develop, standardize, and optimize CI/CD pipelines to enable safe, automated deployments (Git Hub Actions, Git Lab CI, Jenkins, AWS Code Pipeline); promote Dev Ops practices: version-controlled infrastructure, automated testing, immutable deployments, progressive delivery patterns; establish environment consistency across dev/test/stage/prod and ensure infrastructure drift detection and remediation.
  • BCP/DR, RTO/RPO Definition & Testing - Collaborate with stakeholders to evaluate and define service-level RTO and RPO targets; design and implement BCP/DR architectures and procedures; coordinate structured DR tests (tabletop, simulation, partial failover, full failover) and document outcomes; maintain DR runbooks, dependency maps, and recovery checklists; produce metrics and reporting on DR readiness, test results, and continuous improvement actions.
  • Ability to work effectively in a fast-paced, dynamic and high-intensity environment, with timely responsiveness and the ability to work beyond normal business hours when required.
Qualifications
  • 7+ years of experience in SRE, Dev Ops, Platform Engineering, or Systems Engineering roles supporting production environments.
  • Strong proficiency with observability platforms (Datadog, Prometheus/Grafana, ELK/Open Search, Nagios, Nimsoft, etc).
  • Strong hands‑on AWS experience building and operating production systems.
  • Proven expertise with Infrastructure as Code (Terraform and/or Cloud Formation/CDK).
  • Strong CI/CD and automation background (pipeline design, deployment strategies, testing automation).
  • Experience defining and validating RTO/RPO, and implementing BCP/DR plans with structured testing.
  • Experience with Kubernetes and auto-scaling container platforms (EKS, ECS, or Kubernetes on-prem).
  • Strong Linux fundamentals, networking concepts (DNS, TCP/IP, load balancing), and troubleshooting skills.
  • Proficiency in at least one scripting/programming language (Python, Go, Bash, or similar).
  • Ability to write clear operational documentation, runbooks, and post-incident reports.
Preferred Qualifications
  • Familiarity with Azure and/or Oracle Cloud (OCI).
  • Familiarity with Service Mesh, API Gateways, and distributed tracing tooling.
  • Familiarity with Open Telemetry, client instrumentations and collector configurations.
  • Security and compliance familiarity in cloud environments (IAM design, secrets management, audit logging).
  • Experience implementing progressive delivery (blue/green, canary), feature flags, and automated rollback.
  • Relevant certifications (AWS Solutions Architect/Dev Ops Engineer, Kubernetes CKA/CKAD).
  • Experience with ArgoCD & Karpenter.
Employee Programs & Benefits
  • Competitive comprehensive medical, dental, retirement and life insurance benefits.
  • Employee assistance & wellness programs.
  • Parental and family leave policies.
  • CCI Community volunteering program - 2 days annually to volunteer at selected charities.
  • Charitable contribution match program.
  • Tuition assistance & reimbursement.
  • Quarterly Innovation & Collaboration Awards.
  • Employee discount program, including access to fitness facilities.
  • Competitive paid time off.
  • Continued learning opportunities.
#J-18808-Ljbffr
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary