Data Center Operations Engineer
Job in
Santa Fe, Santa Fe County, New Mexico, 87503, USA
Listed on 2025-12-22
Listing for:
Cadence
Full Time
position Listed on 2025-12-22
Job specializations:
-
IT/Tech
Systems Engineer, IT Support, Network Engineer, Cloud Computing
Job Description & How to Apply Below
Join to apply for the Data Center Operations Engineer role at Cadence
At Cadence, we hire and develop leaders and innovators who want to make an impact on the world of technology.
Job SummaryThe Data Center Operations Engineer is responsible for supporting, maintaining, and deploying critical data center infrastructure with a strong focus on Linux-based systems, GPU server deployments, and Infini Band networking
. This role requires hands‑on expertise in data center operations, cluster bring‑up, hardware installation, and troubleshooting across compute, network, and GPU environments. The engineer will collaborate closely with global infrastructure, development, and operations teams to ensure reliable, secure, and scalable service delivery.
- Provide hands‑on operational support for all data center projects, deployments, and repair activities.
- Participate in an on‑call rotation and provide on‑site or remote support during maintenance windows and incidents.
- Troubleshoot and resolve operational issues related to Linux servers, GPU platforms, networking, and storage infrastructure.
- Support customer and internal deployments, ensuring timely and successful bring‑up of GPU servers and clusters.
- Perform Infini Band fabric bring‑up, switch configuration, subnet management, and troubleshooting.
- Conduct daily health checks of Linux systems and infrastructure components, proactively identifying and mitigating risks.
- Install, configure, test, and maintain server hardware (rack and stack, labeling, HDDs, memory, CPUs, RAID batteries, NICs, etc.).
- Install, configure, and troubleshoot networking equipment including routers, switches, and terminal servers for out‑of‑band management.
- Review and validate equipment deployments against approved design documentation and standards.
- Support data center builds, refreshes, migrations, and expansions while adhering to quality and safety standards.
- Coordinate with vendors and onsite staff for hardware delivery, diagnostics, replacement, and warranty services.
- Utilize monitoring and alerting frameworks to identify issues, elevate appropriately, and ensure timely service restoration.
- Maintain accurate documentation of operational procedures, system configurations, and runbooks.
- Follow established incident management, escalation procedures, and service‑level agreements (SLAs).
- Collaborate with global teams across time zones to support operational initiatives and continuous improvement efforts.
- Contribute to process improvement initiatives and ensure adherence to documented policies, processes, and procedures.
- Bachelor’s degree in Computer Science, Engineering, Information Technology, or equivalent practical experience.
- Strong hands‑on experience in Linux environments, including system administration, troubleshooting, and performance validation.
- Proficiency with Linux command‑line tools and shell scripting (Bash or equivalent).
- Experience with cluster bring‑up, driver installation, and system‑level configuration.
- Hands‑on experience setting up and validating GPU servers in clustered environments.
- Experience with end‑to‑end GPU testing in Infini Band‑based clusters.
- Working knowledge of Infini Band networking, including switch configuration and subnet management.
- Solid understanding of networking fundamentals, including the OSI model and TCP/IP protocol suite (IP, ARP, ICMP, TCP, UDP, SMTP, FTP, TFTP).
- Experience installing, configuring, and troubleshooting routers, switches, and terminal servers.
- Familiarity with fiber and copper cabling, including IP and SAN deployments.
- Experience managing incident tickets, maintaining acceptable ticket loads, and meeting SLAs.
- Strong organizational skills with meticulous attention to detail in data center environments.
- Ability to follow and enforce documented escalation procedures and operational policies.
- Strong verbal and written communication skills, with the ability to collaborate effectively with cross‑functional and global teams.
- Experience supporting HPC, AI, or large‑scale GPU environments.
- Exposure to data center monitoring
- Experience documenting operational processes and maintaining…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×