More jobs:
Senior Systems Software Engineer, AI Infrastructure
Job in
Coos Bay, Coos County, Oregon, 97458, USA
Listed on 2026-02-24
Listing for:
NVIDIA
Full Time
position Listed on 2026-02-24
Job specializations:
-
IT/Tech
Systems Engineer, Cloud Computing, SRE/Site Reliability
Job Description & How to Apply Below
Why consider this job opportunity
- Base salary range of $152,000 - $287,500, depending on level and experience.
- Eligibility for equity and comprehensive benefits package.
- Opportunity for career advancement and growth within a leading technology company.
- Work in a diverse and supportive environment that values innovation and creativity.
- Chance to contribute to groundbreaking AI Infrastructure projects that shape the future of computing.
- Develop and maintain large-scale systems for AI Infrastructure, ensuring reliability, operability, and scalability.
- Collaborate on tooling for HPC, GPU Training, and AI Model training workflows.
- Build tools and frameworks to enhance observability and improve system performance.
- Implement SRE fundamentals, including incident management and performance optimization.
- Work with engineering teams to deliver innovative solutions and uphold high standards for code and infrastructure.
- Degree in Computer Science or related field, or equivalent experience with 5+ years in Software Development, SRE, or Production Engineering.
- Proficiency in Python and at least one other programming language (C/C++, Go, Perl, Ruby).
- Expertise in systems engineering within Linux or Windows environments and cloud platforms (AWS, Azure, GCP, or OCI).
- Strong understanding of SRE principles, including error budgets, SLOs, SLAs, and Infrastructure as Code tools.
- Hands‑on experience with observability platforms and CI/CD systems.
- Experience in AI training, inferencing, and data infrastructure services.
- Proficiency in deep learning frameworks like PyTorch, Tensor Flow, JAX, and Ray.
- Strong background in cloud or hardware health monitoring and system reliability.
- Hands‑on expertise in operating and scaling distributed systems with stringent SLAs.
- Knowledge of incident, change, and problem management processes.
We prioritize candidate privacy and champion equal‑opportunity employment. Central to our mission is our partnership with companies that share this commitment. We aim to foster a fair, transparent, and secure hiring environment for all. If you encounter any employer not adhering to these principles, please bring it to our attention immediately.
We are not the EOR (Employer of Record) for this position. Our role in this specific opportunity is to connect outstanding candidates with a top‑tier employer.
#J-18808-LjbffrPosition Requirements
10+ Years
work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×