Platform Engineer
Job in
Austin, Travis County, Texas, 78716, USA
Listed on 2026-05-16
Listing for:
Green Key Resources
Full Time
position Listed on 2026-05-16
Job specializations:
-
IT/Tech
Systems Engineer, Cloud Computing, SRE/Site Reliability, IT Support
Job Description & How to Apply Below
* Must be eligible for Top Security Clearance
We are seeking a Platform Engineer to lead the operation and reliability of our GPU-based bare-metal Kubernetes infrastructure. In this role, you will own CI/CD systems, maintain high-availability compute environments, and support deployments across both lab and field operations.
Responsibilities- Deploy, manage, and scale bare-metal Kubernetes clusters supporting NVIDIA GPUs, with hybrid cloud bursting to AWS for elastic compute and storage workloads.
- Operate and optimize NVIDIA GPU infrastructure for machine learning training and inference workloads.
- Own the end-to-end CI/CD lifecycle, including build automation, artifact management, signing, version pinning, and repeatable deployments across cloud and edge environments.
- Design and maintain observability systems, including centralized logging, metrics collection, dashboards, and alerting to ensure real-time visibility into infrastructure and application health.
- Partner with robotics, computer vision, and software engineering teams to develop streamlined developer tooling and improve engineering velocity for the our platform.
- Implement and maintain infrastructure-as-code standards using tools such as Terraform, Helm, and Ansible across on-premises and cloud deployments.
- Manage networking, storage, cluster security, and system hardening for production-grade bare-metal environments in accordance with applicable defense and security requirements.
- Strong Python and Bash scripting skills.
- 5+ years of experience in Platform Engineering, Dev Ops, Site Reliability Engineering, or Infrastructure Engineering roles supporting production Kubernetes environments.
- Deep expertise administering bare-metal Kubernetes clusters, including cluster lifecycle management, CNI networking, storage backends, node operations, and upgrades.
- Hands‑on experience with NVIDIA GPU infrastructure, including CUDA, Kubernetes GPU scheduling, NVIDIA device plugins, and ML orchestration platforms such as Kubeflow.
- Strong experience building and maintaining CI/CD systems using tools such as Git Lab CI, Git Hub Actions, Jenkins, or similar platforms.
- Experience with observability and monitoring stacks for distributed Linux systems, including centralized logging, metrics, and alerting platforms (e.g., ELK/Open Search, Prometheus, Grafana).
- Experience building and maintaining Linux-based C++ and Python tool chains using CMake, including cross‑compilation for ARM-based platforms such as NVIDIA Jetson.
- Strong Linux systems administration experience (Debian/Ubuntu preferred), including networking, storage management, kernel tuning, and security hardening in production environments.
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×