System Software Engineer Job Carlsbad area,New Mexico USA,IT/Tech

Hoonify delivers secure, sovereign AI infrastructure designed for the next generation of inference workloads. Powered by TurbOS®, our platform enables organizations and Neo Cloud/data center operators to transform CPU/GPU infrastructure into production-ready AI environments—supporting local LLMs, agentic copilots, RAG, and embeddings. We empower teams with robust model lifecycle management, multi‑tenant controls, usage metering, and fully auditable operations.

The Role

We are seeking a System Software Engineer to help build, deploy, and operate our multi‑cloud computational platform and model‑serving infrastructure underpinning our AI/ML developer platform. This role focuses on implementation, automation, and day‑to‑day operation of production systems, working under the technical direction of senior engineers and the platform's established architectural patterns.

The successful candidate will deliver well‑engineered, well‑tested infrastructure changes, and grow their depth across Kubernetes, GPU‑backed workloads, observability, and continuous delivery in a production environment.

This role enables meaningful growth in cloud infrastructure, distributed systems, and ML serving. You will work directly with senior engineers on real production systems, receive code and design review on your work, and have a clear path to expand scope and ownership as your experience deepens.

Core Responsibilities

Implement and maintain Kubernetes workloads and supporting resources, including manifests, Helm charts, controllers, and configuration for networking, ingress, and storage, following established platform patterns.
Deploy and operate model‑serving workloads on GPU and accelerator node pools, including configuring autoscaling policies, resource requests and limits, and tenant‑specific deployment configurations.
Support model training and simulation workloads on distributed GPU systems.
Build and maintain instrumentation on Prometheus, Grafana, and Open Telemetry, including authoring dashboards, alerting rules, and trace and metric instrumentation for new services.
Implement and improve CI/CD pipelines, including build, test, and deployment automation, and contribute to progressive delivery practices already in use on the platform.
Develop and maintain infrastructure‑as‑code modules and automation scripts in support of repeatable, auditable infrastructure changes across cloud environments.
Support response to production incidents, execute documented runbooks, and contribute to postmortems and follow‑up remediation work.
Investigate and resolve issues across the stack, including container, node, network, and accelerator‑level problems, escalating appropriately when scope exceeds the role.
Write clear documentation, including runbooks, internal references, and design notes for the changes you ship.
Participate in code and design reviews, both as author and reviewer, and incorporate feedback from senior engineers into your work.

Required Qualifications

Bachelor's degree in Computer Science, Computer Engineering, or Information Technology, plus three (3) years relevant work experience or equivalent combination of education and relevant experience
Professional experience in cloud infrastructure, Dev Ops, site reliability, or backend engineering roles involving production system operation.
Working knowledge of Kubernetes in a production context, including writing and debugging manifests, understanding core resource types, and operating production workloads.
Hands‑on experience with at least one major cloud provider (e.g. AWS, GCP, or Open Stack), including its compute, networking, and identity primitives.
Experience instrumenting services and consuming observability data, including writing Prometheus queries, building Grafana dashboards, or working with distributed traces.
Familiarity with CI/CD systems and the basic mechanics of automated build, test, and deployment pipelines.
Experience in configuration management and infrastructure as code tools (e.g. Ansible, Puppet, and Helm)
Proficiency in at least one programming or scripting language used for infrastructure work (Python, Go, Rust, or Bash).
Comfort working in a Linux…