×
Register Here to Apply for Jobs or Post Jobs. X

Infrastructure SRE Architect & Engineering Lead

Job in Bloomfield, Essex County, New Jersey, 07003, USA
Listing for: Apolis
Full Time position
Listed on 2026-06-05
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing, SRE/Site Reliability
Job Description & How to Apply Below
Role:
Infrastructure SRE Architect & Engineering Lead

Location: USA Remote

Job Description:
The Infrastructure SRE Architect & Engineering Lead is responsible for defining and driving the enterprise-scale reliability, observability, and automation strategy across infrastructure services. This role operates at the intersection of architecture, engineering, and operations—establishing standards, guiding engineering practices, and ensuring that reliability engineering principles are embedded into day-to-day service delivery.
As a senior technical leader, this role challenges traditional operations models by introducing measurable reliability frameworks, advanced observability patterns, and automation-driven operations. The position requires leading cross-functional transformation initiatives, influencing platform and infrastructure teams, and continuously improving service resilience, performance, and efficiency through data-driven insights and engineering discipline.

Responsibilities
  • Define and govern enterprise observability and reliability engineering standards, including SLO frameworks, service health models, and instrumentation strategies
  • Lead the design and evolution of observability architectures, including dashboards, alerting strategies, and telemetry integration patterns
  • Establish and drive reliability practices such as SLO management, error budget governance, and proactive risk identification
  • Oversee the development and scaling of automation capabilities, including self-healing workflows, validation pipelines, and configuration compliance controls
  • Provide technical leadership for reliability analysis, including identification of systemic risks, failure patterns, and resilience gaps across infrastructure domains
  • Drive continuous improvement through structured analytics, including performance trends, capacity insights, and cost optimization opportunities
  • Partner with platform engineering and client stakeholders to evaluate and implement new observability and automation capabilities
  • Lead post-incident review frameworks focused on detection effectiveness, diagnostic quality, and prevention strategies
  • Maintain and prioritize a strategic backlog of reliability and automation initiatives aligned to business objectives
  • Mentor engineering teams and promote adoption of SRE principles, modern operational practices, and engineering-driven service delivery
  • Required Skills
  • Strong expertise in Site Reliability Engineering (SRE) principles, including SLO/SLI design, error budget management, and reliability modeling
  • Deep knowledge of observability platforms (e.g., Datadog, Dynatrace, Prometheus, Grafana, Splunk) and telemetry design (metrics, logs, traces)
  • Advanced experience designing automation solutions using tools such as Ansible, Terraform, or cloud-native orchestration frameworks
  • Experience building and operationalizing monitoring, alerting, and incident response frameworks at scale
  • Strong understanding of infrastructure platforms (cloud, compute, storage, network) and their reliability characteristics
  • Demonstrated ability to perform system-level analysis, including trend analysis, capacity modeling, and failure pattern identification
  • Experience leading large-scale engineering or transformation initiatives across distributed teams
  • Strong stakeholder management and communication skills with the ability to influence senior technical and business leaders
Desired Skills
  • Experience implementing SRE practices within managed services or enterprise IT operating models
  • Familiarity with AIOps, event correlation, and predictive analytics platforms
  • Experience with CI/CD pipelines and integrating observability and automation into software delivery life cycles
  • Knowledge of Fin Ops practices related to observability and telemetry cost optimization
  • Exposure to platform engineering concepts and internal developer platforms (IDPs)
  • Relevant certifications such as AWS/Azure Architect, Google Professional Cloud Dev Ops Engineer, or Certified Kubernetes Administrator (CKA)
  • Experience defining and measuring operational KPIs tied to business outcomes and service performance"
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary