More jobs:
Site Reliability Engineer; SRE
Job in
Salem, Marion County, Oregon, 97308, USA
Listed on 2026-02-07
Listing for:
INNOVIT USA INC
Full Time
position Listed on 2026-02-07
Job specializations:
-
IT/Tech
Cloud Computing, SRE/Site Reliability
Job Description & How to Apply Below
Hiring: W2 Candidates Only
Visa: Open to any visa type with valid work authorization in the USA
SummaryA Site Reliability Engineer (SRE) is responsible for ensuring the reliability, scalability, and performance of software systems and infrastructure. This role bridges the gap between development and operations by applying software engineering principles to IT operations, automating processes, and monitoring system health to prevent downtime and improve system efficiency.
Key Responsibilities- Design, implement, and maintain reliable, scalable, and highly available infrastructure and services.
- Monitor system performance, availability, and capacity; respond proactively to incidents and outages.
- Develop and maintain automation tools for deployment, monitoring, and infrastructure management.
- Collaborate with software engineers to design systems with reliability and maintainability in mind.
- Troubleshoot, debug, and resolve complex production issues across multiple systems and services.
- Implement and maintain CI/CD pipelines, configuration management, and version control best practices.
- Conduct post-incident reviews, identify root causes, and implement corrective actions to prevent recurrence.
- Define and enforce service-level objectives (SLOs), service-level indicators (SLIs), and service-level agreements (SLAs).
- Optimize system performance, cost, and resource utilization through analysis and continuous improvement.
- Document infrastructure, operational procedures, incident reports, and monitoring configurations.
- Mentor junior engineers and promote best practices for reliability, automation, and observability.
- Stay current with emerging technologies and Dev Ops practices to improve operational excellence.
- Bachelor s degree in Computer Science, Information Technology, or a related field.
- 3-6 years of experience in site reliability engineering, Dev Ops, or system administration.
- Strong understanding of Linux/Unix systems, networking, and cloud platforms (AWS, Azure, Google Cloud Platform).
- Proficiency in scripting and programming languages such as Python, Bash, Go, or Java.
- Experience with monitoring, logging, and observability tools (Prometheus, Grafana, ELK Stack).
- Familiarity with containerization and orchestration tools (Docker, Kubernetes).
- Experience with Infrastructure as Code (Terraform, Ansible, Cloud Formation).
- Knowledge of CI/CD tools and pipelines (Jenkins, Git Lab, Circle
CI). - Understanding of distributed systems, microservices architecture, and high-availability systems.
- Strong problem-solving, analytical, and communication skills.
- Ability to implement security best practices in operational environments.
- Experience in automating repetitive operational tasks and improving system reliability
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×