Cloud Support Engineer - Managed Cloud Services Job San Jose area,California USA,IT/Tech

Job Summary

We are seeking a highly motivated candidate for the position of Cloud Support Engineer with a strong infrastructure background to support our secure, cloud‑based silicon chip design environments used by external customers for mission‑critical EDA, HPC, and containerized workloads. This role is customer‑facing and service‑oriented, requiring deep technical expertise across Linux, cloud infrastructure, and platform operations, along with a strong commitment to responsiveness, professionalism, and delivering an exceptional customer experience.

Key Responsibilities

Customer Support & Service Excellence – Serve as a primary technical support contact for external customers using secure cloud‑based silicon design and HPC platforms. Deliver timely, responsive, and high‑quality support, ensuring customer issues are acknowledged, communicated, and resolved effectively. Proactively minimize downtime, anticipate customer needs, and resolve issues before they impact workloads. Clearly communicate complex technical issues, status updates, and resolutions to customers with varying levels of expertise.

Build long‑term customer trust through professionalism, ownership, and consistent follow‑through.
Platform, Infrastructure & Environment Support – Support and troubleshoot Linux‑based infrastructure and cloud environments, including compute, storage, networking, and identity components. Operate and support Open Stack‑based private or hybrid cloud platforms, including core services (Nova, Neutron, Cinder, Glance, Keystone, etc.). Support Open Shift / Kubernetes platforms, including cluster operations, workload troubleshooting, networking, storage integration, and upgrades. Maintain availability, performance, and reliability of secure multi‑tenant environments.

Perform system‑level diagnosis across infrastructure layers to identify root cause and remediation paths. Partner with internal platform and engineering teams to drive stability and performance improvements.
HPC, Licensing & Performance Management – Monitor HPC cluster performance, job scheduling, throughput, and queue health. Identify and resolve HPC job performance issues, including scheduler configuration, resource contention, I/O bottlenecks, and memory constraints. Troubleshoot and resolve license availability, utilization, and checkout issues impacting customer workloads. Support distributed resource managers such as Slurm, LSF, SGE, or equivalent schedulers.
Automation & Operational Efficiency – Design, develop, and maintain automation for recurring operational tasks, including: infrastructure and platform health monitoring; capacity tracking and alerting; user provisioning and de‑provisioning; license usage monitoring; detection of abnormal system, container, or job behavior. Use Python, shell scripting, Perl, or similar tools to reduce manual effort and improve mean time to resolution (MTTR). Apply AI‑assisted or agentic automation where appropriate to improve operational efficiency and customer experience.
Security, Compliance & Operations – Operate and support systems containing ITAR‑controlled and CUI data in compliance with regulatory and corporate requirements. Follow documented security, access control, auditing, and change management procedures. Participate in incident response, post‑incident root cause analysis, and corrective action planning. Create and maintain runbooks, knowledge base articles, and customer‑facing documentation.

Required Qualifications

Strong hands‑on experience with Linux system administration and troubleshooting.
Broad infrastructure experience, including compute, storage, networking, and identity services.
Experience operating and supporting Open Stack and/or Open Shift (Kubernetes) environments.
Experience supporting HPC or large‑scale compute environments.
Proficiency in Python, shell scripting, Perl, or similar automation‑focused languages.
Experience with monitoring, logging, and alerting platforms.
Familiarity with license management systems such as Flex Net / FLEXlm or equivalent.
Demonstrated ability to deliver excellent customer service in a technical support, SRE, or infrastructure operations role.
Strong…