AI Systems Administrator
Listed on 2026-06-18
-
IT/Tech
Unix/Linux, Cybersecurity, IT Support, Cloud Computing: Infrastructure & Operations
Overview
Draper is an independent, nonprofit research and development company headquartered in Cambridge, MA. The 2,000+ employees of Draper tackle important national challenges with a promise of delivering successful and usable solutions. From military defense and space exploration to biomedical engineering, lives often depend on the solutions we provide. Our multidisciplinary teams of engineers and scientists work in a collaborative environment that inspires the cross‑fertilization of ideas necessary for true innovation.
Job Description SummaryThe AI Systems Administrator is instrumental in bringing AI to Draper. The incumbent implements a closed GPT environment at Draper in which several different LLM models are maintained and used throughout the organization. This role works with engineering to ensure that multiple LLMs are accessible through a chat interface, API, and assistive tools for the general purpose of the organization.
In addition, the administrator will ensure the system health of the DraperGPT server to allow for additional AI infrastructure requiring large amounts of compute to be utilized without impacting the performance of other LLM resources. This will include API interfaces with various software platforms across Draper (e.g., engineering, accounting, legal). The role helps Draper implement automation, streamline processes, and support mission‑critical AI/ML workloads.
Resource allocation is critical.
The position also involves traditional Linux admin duties (installing, configuring, securing servers, scripting, monitoring) with a strong focus on supporting AI/ML (e.g., GPU servers, Kubernetes, data pipelines). It includes guidance to AI engineers and is part of a team of Linux system administrators managing approximately 750 computers, primarily Oracle Linux. Additional OS knowledge (Ubuntu, RHEL) may be necessary. Responsibilities include maintaining security, serving as a front‑line interface to end users, recommending hardware and software purchases, interacting with vendors, and training other administrators.
Hybrid (3 days/week) in Cambridge, MA. Requires an Active Secret Clearance.
Duties/Responsibilities- Build, operate, and troubleshoot RHEL/Oracle systems supporting GPU workloads (OS lifecycle, patching, performance, reliability).
- Manage the GPU enablement layer: driver and toolkit lifecycle, kernel/driver compatibility, coordinated upgrades and rollback plans, and ongoing health monitoring.
- Implement and maintain observability (metrics, logs, alerting) for system, GPU, and storage performance/health (e.g., Prometheus/Grafana, GPU telemetry such as DCGM/NVML).
- Couple observability with LLM performance and usage, and identify and warn users over allocating resources.
- Maintain LLM servers (resetting or rebuilding) to ensure high uptime and usage capabilities across the organization.
- Work with engineers to allow software upgrades (new models, additional AI software) while maintaining security needs.
- Partner with storage/network peers to baseline throughput/latency, identify bottlenecks, and tune the platform for predictable performance.
- Automate and script platform administration and broader Linux team workflows (provisioning, configuration enforcement, patch orchestration, reporting, routine maintenance) using Git‑based practices (Python/Ansible).
- Support various Linux and cloud (AWS/Azure) projects.
- Lead projects including large‑scale migrations, platform redesign, and implementation, utilizing resources within the Linux team and across the IS department.
- Strong production Linux administration experience (RHEL/Oracle preferred): systemd, networking, troubleshooting, performance analysis, patching, package management.
- Strong automation skills:
Bash and/or Python, plus Ansible (preferred) or equivalent configuration management; comfortable with CI/Git workflows. - Experience supporting enterprise platforms (incident response, root‑cause analysis, post‑mortems, runbooks/documentation).
- Security‑minded operations in regulated environments; familiarity with CUI handling concepts and control expectations (audit logging, vulnerability remediation, change control).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).