Site Reliability Engineer; SRE Job Salem area,Oregon USA,IT/Tech

Position: Site Reliability Engineer (SRE)

Hiring: W2 Candidates Only

Visa: Open to any visa type with valid work authorization in the USA

Summary

A Site Reliability Engineer (SRE) is responsible for ensuring the reliability, scalability, and performance of software systems and infrastructure. This role bridges the gap between development and operations by applying software engineering principles to IT operations, automating processes, and monitoring system health to prevent downtime and improve system efficiency.

Key Responsibilities

Design, implement, and maintain reliable, scalable, and highly available infrastructure and services.
Monitor system performance, availability, and capacity; respond proactively to incidents and outages.
Develop and maintain automation tools for deployment, monitoring, and infrastructure management.
Collaborate with software engineers to design systems with reliability and maintainability in mind.
Troubleshoot, debug, and resolve complex production issues across multiple systems and services.
Implement and maintain CI/CD pipelines, configuration management, and version control best practices.
Conduct post-incident reviews, identify root causes, and implement corrective actions to prevent recurrence.
Define and enforce service-level objectives (SLOs), service-level indicators (SLIs), and service-level agreements (SLAs).
Optimize system performance, cost, and resource utilization through analysis and continuous improvement.
Document infrastructure, operational procedures, incident reports, and monitoring configurations.
Mentor junior engineers and promote best practices for reliability, automation, and observability.
Stay current with emerging technologies and Dev Ops practices to improve operational excellence.

Qualifications

Bachelor s degree in Computer Science, Information Technology, or a related field.
3-6 years of experience in site reliability engineering, Dev Ops, or system administration.
Strong understanding of Linux/Unix systems, networking, and cloud platforms (AWS, Azure, Google Cloud Platform).
Proficiency in scripting and programming languages such as Python, Bash, Go, or Java.
Experience with monitoring, logging, and observability tools (Prometheus, Grafana, ELK Stack).
Familiarity with containerization and orchestration tools (Docker, Kubernetes).

Preferred Skills / Duties

Experience with Infrastructure as Code (Terraform, Ansible, Cloud Formation).
Knowledge of CI/CD tools and pipelines (Jenkins, Git Lab, Circle

CI).
Understanding of distributed systems, microservices architecture, and high-availability systems.
Strong problem-solving, analytical, and communication skills.
Ability to implement security best practices in operational environments.
Experience in automating repetitive operational tasks and improving system reliability

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language