Sr Edge Platforms SRE - Term Appointment
Evanston, Cook County, Illinois, 60208, USA
Listed on 2025-12-27
-
IT/Tech
Cloud Computing, Systems Engineer, IT Support
Department: NAISE - NU ANL Inst Sci Eng
Salary/Grade: ITS/82
Job SummaryThis will be an SRE role with a focus on maintaining and improving operations of the edge fleet, cloud infrastructure, and data pipeline associated with multiple NSF and DOE funded projects. At this time, the projects collectively operate nearly 200 remote edge devices, each running Linux and a local Kubernetes cluster to host user applications. We expect this number to grow by around 300 devices over the next 5 years as part of Sage Grande, our latest NSF funded project, totaling to a fleet of nearly 500 devices.
(See NSF award for more information.)
This incumbent will work closely with the software team to understand the existing design, requirements, and prior issues to inform decisions on monitoring tooling to either be selected or built as needed. The incumbent will also work with key collaborators (various universities, national labs, industry partners, tribal partners, and other non-profit organizations) to ensure that their expectations for nodes and data are being met.
Finally, we expect that this role will provide good opportunities for career growth. First, our fleet will continue to grow, so we expect multiple iterations on designing and implementing ideas and new technologies as they become available. Second, we anticipate additional cloud infrastructure and backends as we support more projects. This will provide plenty of time to understand cloud infrastructure and work with the software team to learn useful patterns for instrumentation and monitoring.
Last, the unique nature of our fleet deployment means the incumbent will likely develop software engineering and data analysis skills through implementing novel tooling for addressing issues at scale.
This is a one-year term position. Opportunity for renewal will be based on performance and available funding.
The primary work location is Argonne National Laboratory. This position is primarily on-site, with the possibility of occasional remote work depending on job responsibilities and with management approval. Some travel to other sites is required.
* Note:
Not all aspects of the job are covered by this job description.
- Addressing software and minor hardware issues in the edge fleet in a timely manner and escalating issues which need attention from the deployment team and/or on-site staff.
- Selecting, developing, and managing tooling and infrastructure for monitoring and alerting. Due to the unique aspects of our edge deployment, it is expected that you will develop substantial software tooling to address gaps that existing tools do not cover.
- Developing relevant dashboards for the software team to understand how well services are performing.
- Performing routine maintenance such as software upgrades and minor tasks such as renewing domain certificates annually.
- Setup and manage support ticket systems for platform and device issues.
- Lead a small team (1-2 people) of junior SREs, as we grow the SRE team.
Perform other duties as assigned.
MINIMUM QUALIFICATIONS (EDUCATION, EXPERIENCE, CERTIFICATIONS, SKILLS)- Successful completion of a full 4-year course of study in an accredited college or university leading to a bachelor's or higher degree in a major such as computer science, information technology, or related; OR appropriate combination of education and experience.
- 4-5 years of direct experience supporting code, services, and deployments in production.
- Demonstrated experience in Linux, including fundamentals of scripting, user management, networking, package management, SSH, and debugging.
- Experience in software engineering and Python.
- Familiarity with Kubernetes, particularly using Kubernetes for deployments, and being familiar with deploying and administering Kubernetes clusters.
- Familiarity with monitoring and data collection tooling such as Prometheus, Grafana, Fluentbit, and Loki.
- Familiarity with basic cybersecurity best practices such as how to securely deploy a web service.
- Strong willingness to learn new tools and technologies on the job.
- Strong communication skills.
- Familiarity with embedded Linux devices such as Raspberry Pi or Nvidia Jetson and Orin family.
- Familiarity with basic cloud infrastructure concepts such as time series databases (ex. Influx
DB) S3 storage, message brokers (ex. Rabbit
MQ), caching (ex. Redis), and web services. - Familiarity with Infrastructure as Code and config management tooling such as Ansible.
- Familiarity with basic data analysis and visualization in Python, with a strong ability to communicate issues using these tools.
- A B.S. or M.S. degree in CS or related fields
- Linux Operating System
- Puppet/Chef/Ansible
- SQL/MySQL/Postgres
- Python
- Shell Scripting
Target hiring range for this position will be between $115,000 to $132,750 per year. Offered salary will be determined by the applicant's education, experience, knowledge, skills and abilities, as well as internal…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).