Cloud Reliability Engineer Job East Orange area,New Jersey USA,IT/Tech

VERSANT (Nasdaq: VSNT) is an industry-changing media and entertainment business and home to trusted brands that shape culture, inform audiences, and build lasting connections. It operates across four core markets: political news and opinion, business news and personal finance, golf, and sports and genre entertainment. These markets are served through a powerful portfolio of iconic and innovative brands, including CNBC, MS NOW, USA Network, Golf Channel, Oxygen, E!,

SYFY, and Versant’s sports division USA Sports, along with complementary digital assets including Fandango, Rotten Tomatoes, Golf Now and Golf Pass.

The Cloud Reliability Engineer is responsible for ensuring the availability, performance, scalability, and operational excellence of VERSANT’s cloud platforms and services.

This role works closely with cloud engineering, application development, networking, security, and operations teams to build and maintain highly reliable systems across a large multi‑account AWS environment. The engineer will leverage automation, observability, and reliability engineering practices to improve platform resilience, reduce operational risk, and enhance the customer experience.

As a leading media company, VERSANT operates digital products, streaming platforms, content delivery systems, and media workflows that demand high levels of uptime and performance. The Cloud Reliability Engineer will help ensure these services remain resilient, scalable, and operationally mature.

The ideal candidate has strong experience with AWS, monitoring and observability platforms, incident management, automation, infrastructure as code, and operational best practices. Experience with AWS Organizations, Control Tower, Identity Center, Terraform, and modern cloud operations tooling is highly desirable.

Responsibilities

Reliability Engineering
- Design, implement, and maintain reliability practices for cloud infrastructure and platform services.
- Define and monitor service‑level objectives (SLOs), service‑level indicators (SLIs), and operational metrics.
- Identify reliability risks and implement solutions that improve availability, scalability, and resilience.
- Drive continuous improvement initiatives focused on operational excellence and system stability.
Monitoring, Observability & Performance
- Design and maintain monitoring, logging, alerting, and observability solutions across AWS environments.
- Develop dashboards and reporting that provide visibility into platform health and performance.
- Analyze system behavior, identify bottlenecks, and implement performance improvements.
- Establish proactive monitoring practices that detect issues before they impact customers.
Incident Response & Operational Excellence
- Participate in incident response, troubleshooting, and root cause analysis activities.
- Lead post‑incident reviews and identify corrective actions to prevent recurrence.
- Improve operational processes, runbooks, and recovery procedures.
- Support disaster recovery and business continuity initiatives.
AWS Platform Reliability
- Support the reliability and operational health of large‑scale AWS environments utilizing AWS Organizations, Control Tower, and Identity Center.
- Partner with cloud engineering teams to improve platform architecture, resiliency, and operational consistency.
- Assist in maintaining secure, scalable, and highly available cloud services.
Automation & Infrastructure as Code
- Develop automation that reduces operational toil and improves system reliability.
- Support infrastructure‑as‑code solutions using Terraform, Cloud Formation, and related technologies.
- Automate operational workflows, monitoring, remediation, and recovery activities.
- Contribute to CI/CD pipelines and deployment automation initiatives.
Media & Digital Platform Reliability
- Support the reliability of streaming platforms, content delivery systems, media workflows, APIs, and customer‑facing applications.
- Collaborate with engineering teams to improve application reliability and operational readiness.
- Assist in capacity planning and scaling efforts for high‑traffic events and media workloads.
Collaboration & Continuous Improvement
- Partner with cloud, networking, security, and application…