More jobs:
HPC Engineer
Job in
Austin, Travis County, Texas, 78716, USA
Listed on 2026-05-18
Listing for:
Arm
Full Time
position Listed on 2026-05-18
Job specializations:
-
IT/Tech
Systems Engineer, SRE/Site Reliability
Job Description & How to Apply Below
Job Overview
Engineering IT provides the high-performance compute platforms that enable Arm’s engineering teams to design, verify, and deliver world-class products. The team operates a mix of on‑premises and cloud‑based HPC environments, EDA enablement services, job scheduling platforms, automation tooling, and custom workflows that are critical to engineering productivity across Arm.
We are looking for an HPC Operations Engineer to help run, improve, and modernize these services. This role combines production operations, site reliability engineering, automation, cloud integration, and close collaboration with engineering users and infrastructure teams.
Responsibilities- Operate, support, and continuously improve Arm’s HPC platforms, with a solid focus on IBM Spectrum LSF and related job scheduling services.
- Improve reliability, scalability, performance, and operational efficiency through automation, observability, standardization, and SRE practices.
- Develop automation and self‑service capabilities to reduce manual operational effort and improve the user experience.
- Support production HPC environments, including incident response, resolve, root cause analysis, service restoration, and continuous improvement.
- Work directly with engineering users to improve job scheduling behavior, workload performance, resource utilization, and platform efficiency.
- Develop and maintain scripts, tools, and automation frameworks using Python, Bash, and related technologies.
- Support modernization initiatives involving containers, Kubernetes, Docker, cloud‑native services, Infrastructure as Code, and alternative scheduling or orchestration technologies.
- Contribute to cloud HPC integration across AWS, GCP, Azure, Open Stack, and hybrid environments.
- Collaborate with platform, cloud, storage, infrastructure, networking, and security teams to deliver robust engineering services.
- Contribute to project delivery by working with technical leads, architects, project managers, and operational team members.
- Help define and promote standards for Dev Ops, SRE, platform engineering, CI/CD, monitoring, and infrastructure automation.
- Experience operating HPC environments and job schedulers such as IBM Spectrum LSF, Slurm, PBS, Grid Engine, or similar.
- Strong Linux system administration experience, preferably with RHEL or RHEL‑based distributions.
- Good scripting and automation skills using Python, Bash, Shell, or similar languages.
- Experience supporting production infrastructure, including incident management, resolve, operational recovery, and conducting RCA or comparable experience.
- Familiarity with monitoring, alerting, and observability platforms such as Dynatrace, Prometheus, Grafana, or similar.
- Experience building, maintaining, or supporting CI/CD pipelines and automation frameworks.
- Experience with public, private, or hybrid cloud platforms, including AWS, GCP, Azure, Open Stack, and Kubernetes‑based services.
- Understanding of Dev Ops, SRE, platform engineering, infrastructure automation, and operational excellence principles.
- Familiarity with Agile delivery practices and collaboration tools such as Jira and Confluence.
- Ability to work with engineering users, understand workload requirements, and translate operational issues into practical improvements.
- Experience working in EDA or semiconductor engineering environments.
- Familiarity with EDA tools, license‑aware scheduling, large‑scale batch workloads, and engineering compute workflows.
- Exposure to container platforms and orchestration technologies such as Docker, Kubernetes, and Kubernetes‑native scheduling.
- Experience with Infrastructure as Code tools such as Terraform and Ansible.
- Exposure to alternative schedulers such as Slurm or cloud‑native workload orchestration systems.
- Experience using AI‑assisted tooling, MCP, agentic services, or automation agents to improve diagnostics, operations, optimization, or self‑service support.
- Experience operating large‑scale distributed systems across both on‑premises and cloud infrastructure.
$130,100-$176,000 per year
Accommodations at ArmAt Arm, we want to build extraordinary teams. If you need…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×