Platform Engineer Job Austin area,Texas USA,IT/Tech

* Must be eligible for Top Security Clearance

Position Overview

We are seeking a Platform Engineer to lead the operation and reliability of our GPU-based bare-metal Kubernetes infrastructure. In this role, you will own CI/CD systems, maintain high-availability compute environments, and support deployments across both lab and field operations.

Responsibilities

Deploy, manage, and scale bare-metal Kubernetes clusters supporting NVIDIA GPUs, with hybrid cloud bursting to AWS for elastic compute and storage workloads.
Operate and optimize NVIDIA GPU infrastructure for machine learning training and inference workloads.
Own the end-to-end CI/CD lifecycle, including build automation, artifact management, signing, version pinning, and repeatable deployments across cloud and edge environments.
Design and maintain observability systems, including centralized logging, metrics collection, dashboards, and alerting to ensure real-time visibility into infrastructure and application health.
Partner with robotics, computer vision, and software engineering teams to develop streamlined developer tooling and improve engineering velocity for the our platform.
Implement and maintain infrastructure-as-code standards using tools such as Terraform, Helm, and Ansible across on-premises and cloud deployments.
Manage networking, storage, cluster security, and system hardening for production-grade bare-metal environments in accordance with applicable defense and security requirements.

Qualifications

Strong Python and Bash scripting skills.
5+ years of experience in Platform Engineering, Dev Ops, Site Reliability Engineering, or Infrastructure Engineering roles supporting production Kubernetes environments.
Deep expertise administering bare-metal Kubernetes clusters, including cluster lifecycle management, CNI networking, storage backends, node operations, and upgrades.
Hands‑on experience with NVIDIA GPU infrastructure, including CUDA, Kubernetes GPU scheduling, NVIDIA device plugins, and ML orchestration platforms such as Kubeflow.
Strong experience building and maintaining CI/CD systems using tools such as Git Lab CI, Git Hub Actions, Jenkins, or similar platforms.
Experience with observability and monitoring stacks for distributed Linux systems, including centralized logging, metrics, and alerting platforms (e.g., ELK/Open Search, Prometheus, Grafana).
Experience building and maintaining Linux-based C++ and Python tool chains using CMake, including cross‑compilation for ARM-based platforms such as NVIDIA Jetson.
Strong Linux systems administration experience (Debian/Ubuntu preferred), including networking, storage management, kernel tuning, and security hardening in production environments.

#J-18808-Ljbffr