×
Register Here to Apply for Jobs or Post Jobs. X

Senior Reliaibility Engineer - Technology

Job in New York, New York County, New York, 10261, USA
Listing for: Truelogic Software LLC
Full Time position
Listed on 2025-12-20
Job specializations:
  • IT/Tech
    Systems Engineer, SRE/Site Reliability, Cloud Computing, IT Support
Salary/Wage Range or Industry Benchmark: 60000 - 80000 USD Yearly USD 60000.00 80000.00 YEAR
Job Description & How to Apply Below
Location: New York

About Truelogic

At Truelogic, we are a leading provider of nearshore staff augmentation services headquartered in New York. For over two decades, we’ve been delivering top-tier technology solutions to companies of all sizes, from innovative startups to industry leaders, helping them achieve their digital transformation goals.

Our team of 600+ highly skilled tech professionals, based in Latin America, drives digital disruption by partnering with U.S. companies on their most impactful projects. Whether collaborating with Fortune 500 giants or scaling startups, we deliver results that make a difference.

By applying for this position, you’re taking the first step in joining a dynamic team that values your expertise and aspirations. We aim to align your skills with opportunities that foster exceptional career growth and success while contributing to transformative projects that shape the future.

Our Client

A data-driven technology company that partners with high-growth brands to optimize customer acquisition and retention. It specializes in delivering high-LTV audiences and enrichment data to increase repeat purchase rates. The company collaborates with major platforms and agencies such as Shopify, Experian, Trans Union, and top media partners, all focused on driving profitable revenue growth.

Job Summary

The Site Reliability Engineer plays a key role in operating, observing, and improving the reliability of existing distributed systems running on AWS and Kubernetes, with a strong emphasis on observability, operational maturity, and automated responses to system behavior. Rather than focusing on provisioning infrastructure from scratch, this role concentrates on understanding how services behave in production, detecting when they are not operating correctly, and enabling automated scaling, recovery, and remediation using existing platforms and tooling.

The engineer partners closely with backend and platform teams to evolve observability practices, define reliability signals, and improve how the platform responds to operational and performance concerns, driving overall system resilience and reliability.

Responsibilities
  • Designs, implements, and continuously improves observability strategies across services, including metrics, logs, traces, alerts, and dashboards.

  • Focuses on understanding system behavior in production, identifying failure modes, performance bottlenecks, and reliability risks.

  • Evolves and maintains shared AWS CDK and CDK8s constructs, with emphasis on observability, autoscaling, and operational safeguards rather than basic infrastructure provisioning.

  • Maintains and operates core platform components such as VPC, EKS clusters, RDS, Open Search, and MSK, ensuring they expose meaningful operational signals.

  • Operates and enhances Kubernetes cluster addons such as ingress controllers, cert-manager, autoscalers, and monitoring, logging, and tracing stacks.

  • Defines and maintains SLIs, SLOs, and alerting strategies that clearly distinguish between symptoms, root causes, and actionable operational events.

  • Improves automated operational responses, including autoscaling, self-healing mechanisms, and runbook-driven remediation.

  • Ensures high reliability through structured alerting systems (Prometheus, Cloud Watch), noise reduction, alert quality improvements, and recovery mechanisms.

  • Collaborates with engineering teams to investigate production incidents, perform root cause analysis, and drive long-term reliability improvements.

  • Owns CI/CD pipelines for Infrastructure as Code (IaC) and observability-related platform components.

  • Applies Site Reliability Engineering (SRE) principles—including observability-first design, error budgets, and operational readiness—to shared platform services.

  • Supports IAM roles, secrets management, and tenant isolation best practices.

Qualifications and Job Requirements
  • Has 5+ years of experience in Site Reliability Engineering, Platform Engineering, or Infrastructure roles, with significant hands-on experience operating and supporting production systems.

  • Demonstrates strong experience in observability operations, including defining metrics, logs, traces, dashboards, alerts, and…

Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary