Principal Systems/Software Administrator
Vancouver, BC, Canada
Listed on 2026-03-10
-
IT/Tech
Cloud Computing, SRE/Site Reliability, Systems Engineer
Principal Systems/Software Administrator
This role has been designated as ‘Remote/Teleworker’, which means you will primarily work from home.
Who We Are:Hewlett Packard Enterprise is the global edge-to-cloud company advancing the way people live and work. We help companies connect, protect, analyze, and act on their data and applications wherever they live, from edge to cloud, so they can turn insights into outcomes at the speed required to thrive in today’s complex world. Our culture thrives on finding new and better ways to accelerate what’s next.
We know varied backgrounds are valued and succeed here. We have the flexibility to manage our work and personal needs. We make bold moves, together, and are a force for good. If you are looking to stretch and grow your career, our culture will embrace you. Open up opportunities with HPE.
HPE / Mist is seeking a Principal Systems/Software engineer (SRE) to join our cloud infrastructure team. In this role, you will support and scale highly available SaaS platforms powered by AI-driven cloud technologies.
You will play a critical role in maintaining production stability, improving reliability, and enabling rapid growth across multi-cloud environments (AWS and GCP). Your primary focus will be incident management, release management, and operational excellence for large-scale distributed systems.
Key Responsibilities- Ensure high availability, reliability, and performance of large-scale cloud infrastructure across AWS and GCP, meeting defined SLAs and SLOs
- Operate and support infrastructure components, data streaming platforms, and databases, including:
Kubernetes, Kafka, Flink, Storm, Spark;
Cassandra, Elasticsearch, Redis, Postgres, Arango
DB, and related technologies - Monitor, troubleshoot, and resolve production issues across microservices and distributed systems
- Partner closely with software engineering teams to debug and resolve complex production incidents
- Participate in a 24x7 on‑call rotation supporting a multi‑cloud environment
- Monitor system metrics, application performance, and infrastructure health
- Own the full incident management lifecycle, including detection, mitigation, RCA creation, and post‑incident reviews
- Develop, maintain, and improve runbooks and automated operational processes
- Perform capacity planning using performance, usage, and utilization data
- Apply and promote SRE best practices, operational standards, and continuous improvement initiatives
- Bachelor’s degree in computer science, Computer Engineering, or equivalent practical experience
- 10+ years of overall Dev Ops / Site Reliability Engineering experience
- 7+ years of hands‑on experience with cloud platforms such as AWS or GCP, including:
Compute (EC2 / GCE), IAM, object storage (S3 / GCS);
Docker, Kubernetes (pods and clusters); CI/CD tools such as Jenkins;
Monitoring and observability tools (Prometheus, Cloud Watch, Stackdriver);
Linux‑based systems and configuration management (Ansible) - 7+ years of experience deploying and managing production workloads using CI/CD pipelines in AWS or GCP environments
- 5+ years of administration experience with distributed systems and streaming platforms, including Kafka, Cassandra, Elasticsearch, Spark, Flink, Storm, and cloud services such as EMR, Dataproc, Elasti Cache, AWS RDS, or GCP SQL
- 5+ years of automation experience using Python, Go, and/or Rust, plus shell scripting
- 5+ years of experience designing and implementing metrics to monitor infrastructure and application health
- Working knowledge of Infrastructure as Code (Terraform, Cloud Formation, or equivalent)
- Open‑source software contributions
- Experience with AIOps or Generative AI technologies
- Workflow and automation experience using Git Hub Actions, Google Workflows, Jenkins, Git Lab, Slack, and Jira/Confluence
- Experience managing microservices release operations at scale
Skills:
Not listed
What We Can Offer You:Health & Wellbeing
We strive to provide our team members and their loved ones with a comprehensive suite of benefits that supports their physical, financial and emotional wellbeing.
Personal & Professional DevelopmentWe also invest in your…
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: