More jobs:
Sr. Manufacturing Site Reliability Engineer, Windows Platform
Job in
Austin, Travis County, Texas, 78719, USA
Listed on 2026-06-06
Listing for:
Tesla
Full Time
position Listed on 2026-06-06
Job specializations:
-
IT/Tech
Systems Engineer, IT Support
Job Description & How to Apply Below
Tesla's Manufacturing SRE team owns the underlying platforms that keep production lines moving across Fremont, Sparks, Austin, Berlin, and Shanghai. The fleet spans tens of thousands of Linux and Windows hosts driving Station Controllers, Industrial PCs (IPCs), camera systems, robot controllers, and label printers across every shop on every line.
The Controls organization is shifting more workloads back to Windows, and the Optimus program is on track to run more Windows computers than Linux in production. Today, one engineer carries the Windows specialty for the entire team. This role exists to change that, to scale the depth of Windows expertise on MFGSRE so the platform can absorb the Optimus rollout, the IPC fleet growth, and a steady increase in Windows software complexity per host without losing reliability.
You partner with Controls Engineering, IT Manufacturing Operations, and the Optimus team to make sure Windows hosts boot, image, monitor, recover, and patch the same way Linux hosts already do, and you contribute back to the broader MFGSRE platform (TFI imaging, WINFinder inventory, TFO observability) so the wins compound.
What You'll Do
* Own the Windows production fleet end to end:
Industrial PCs, Tangents, MTE benches, Optimus dyno PCs, GA station HMIs, and the long tail of factory Windows hosts across all sites
* Drive Windows imaging through the TFI PXE pipeline (Ansible, Jenkins, Artifactory) so a new IPC can boot, join the domain, and report green telemetry without manual intervention
* Extend WINFinder, the Windows host inventory and management service, to cover new platforms and new sites; co-maintain Windows LAPS rotation, AD lifecycle, and the ITMFGAgent runtime
* Push the Windows fleet onto the same observability surface as Linux:
Grafana Alloy and the Tesla Metrics Agent (TMA) collecting metrics, structured logs into Splunk MFGSRE indexes, alerting in Opsgenie or JSM
* Build automation for the Windows-specific operational pain that does not exist on Linux: GPO drift, driver and firmware management, Windows Update windows, NTFS permission audits, time zone enforcement across sites, certificate rotation for OPC-UA and ACR
* Roll out and sustain the Sentinel One agent across factory Windows endpoints, partnering with Infosec on detections and exclusions tuned for production hardware
* Own the deployment story for Windows-resident Tesla applications (Print App configurator, ZCP, NX Witness, station controllers) when they touch Windows boxes
* Carry production on-call rotation for Windows incidents: triage P1 line-down events, write the runbook, file the Jira, drive the post-mortem, and turn the fix into automation
* Contribute to the cross-platform tools (TFI, TFD, TFC, TFO, ITMFG/itmfg-windows, ITMFG/print-app) that MFGSRE owns; submit PRs in Go, Python, Power Shell, or Ansible as the work demands
What You'll Bring
* 5+ years operating Windows in production at scale, including Active Directory, GPO, Windows Server, and Windows 10 or 11 LTSC on industrial hardware
* Strong Power Shell skills: scripts that hit the Win
32 API, parse event logs, drive WMI, and integrate with REST endpoints, not one-liners
* Hands-on experience with at least one configuration management or imaging platform:
Ansible, Intune, SCCM, MDT, Puppet, or equivalent custom PXE work
* SRE practice fundamentals: SLO design, alert hygiene, runbook discipline, blameless post-mortem authorship, error-budget thinking
* Working knowledge of Linux as a peer platform; you do not need to be a kernel hacker, but you can read a systemd unit, write a bash one-liner, and submit a clean Ansible PR
* Comfort writing application code in at least one of Python, Go, or C#, enough to ship a small service or extend an existing one
* Production experience with observability tooling:
Prometheus or Grafana, Splunk or equivalent log platform, Open Telemetry concepts
* A bias for automating away repeated work, even at the cost of more upfront engineering effort
* Direct, low-ego communication style; comfort working asynchronously across Sparks, Fremont, Austin, and Berlin
Compensation and Benefits
Benefits
Along with competitive pay, as a…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×