SRE; DevOps Job New York New York USA,IT/Tech

Position: Application SRE (DevOps)
Location: New York

Don't miss the next one. Get matching roles delivered to your inbox.

ELLKAY started out providing connectivity solutions to laboratories and within a few years, grew to also provide data management solutions to ambulatory organizations. ELLKAY is now a trusted data management partner in five healthcare segments. ELLKAY's solutions continue to serve laboratories and ambulatory practices and have expanded to empower hospitals and health systems, healthcare IT vendors, ambulatory practices, health plans, and other healthcare organizations with cutting-edge technologies and solutions that drive their growth and interoperability strategies.

Today, ELLKAY remains true to our core values, building strong partner relationships and offering unparalleled service and support while providing innovative, scalable solutions to the challenges our customers face in today's data-rich world.

ELLKAY's experience, customer-focused approach, and reputation for innovation, speed, and accuracy differentiate ELLKAY as a premier partner for your interoperability needs and data management strategy.

Job Description

We are looking for an Application Site Reliability Engineer (SRE) with strong Dev Ops experience to improve the reliability, scalability, and performance of our applications.

The Application Site Reliability Engineer will serve as a technical contact responsible for driving the reliability, performance, and operational maturity of our application ecosystem. This role works across multiple teams to support scalable systems, establish reliability standards, improve observability, and implement automation that reduces operational effort. The SRE will lead complex incident responses, work with engineering teams in best practices, and influence architectural decisions to ensure resilient, high-quality software delivery.

You will help define reliability standards, reduce operational toil, and ensure smooth production operations while enabling faster and safer releases.

Own application reliability, availability, performance, and scalability in production and non-production environments
Design, build, and maintain CI/CD pipelines for application deployments
Automate infrastructure provisioning and configuration using Infrastructure as Code
Monitor application health using metrics, logs, and traces; define SLIs, SLOs, and error budgets
Lead incident response, root-cause analysis (RCA), ensuring corrective and preventive actions are completed and communicated.
Improve system resilience through capacity planning, system tuning, and fault tolerance
Partner with development teams to ensure services meet reliability, performance, and scalability objectives.
Reduce manual operational effort through automation and self-healing solutions
Serve as a point of contact for critical Sev1/Sev2 incidents, leading incident command when required.

Qualifications

Strong experience as an SRE, Dev Ops Engineer, or Production Support Engineer
Solid understanding of Windows, Linux/Unix systems and networking fundamentals
7 years of experience as an SRE
Hands-on experience with cloud platforms such as AWS, Azure, or GCP
Experience with containerization and orchestration tools like Docker and Kubernetes
Proficiency in CI/CD tools such as Jenkins, Git Hub Actions, , or similar
Experience with Infrastructure as Code tools like Terraform, Cloud Formation, or ARM
Strong scripting skills in Python, Bash, or similar languages
Experience with monitoring and observability tools (Prometheus, Grafana, ELK, Datadog, etc.)
Understanding of reliability concepts such as SLAs, SLOs, and incident management

Preferred Qualifications

Experience supporting microservices-based architectures
Knowledge of security best practices in cloud and Dev Ops environments
Experience with configuration management tools (Ansible, Chef, or Puppet)
Exposure to chaos engineering or resilience testing practices

Soft Skills

Strong problem-solving and troubleshooting skills
Ability to work calmly during incidents and high-pressure situations
Clear communication and collaboration with cross-functional teams
Ownership mindset with a focus on continuous improvement

What We Offer

Opportunity to work on highly…