More jobs:
Cloud Engineer - Observability & SRE
Remote / Online - Candidates ideally in
Plano, Collin County, Texas, 75086, USA
Listed on 2026-06-02
Plano, Collin County, Texas, 75086, USA
Listing for:
GDH
Remote/Work from Home
position Listed on 2026-06-02
Job specializations:
-
IT/Tech
Cloud Computing, SRE/Site Reliability
Job Description & How to Apply Below
Role Summary
A senior Cloud Engineer with expertise in building and managing scalable observability and infrastructure platforms for enterprise-level cloud microservices environments. This hybrid role demands hands‑on experience with container orchestration, cloud infrastructure automation, and high‑volume monitoring systems. The engineer will own end-to-end components, support production operations, and leverage AI tools for system troubleshooting and code generation.
Responsibilities- Design, develop, and operate observability platforms enabling logging, metrics collection, and tracing for cloud-based microservices applications.
- Manage and optimize large‑scale Kubernetes clusters across multiple regions, including Helm chart management, pod scheduling, and resource tuning.
- Own and maintain CI/CD pipelines using tools such as Argo CD, Helm, and Git Ops methodologies to ensure reliable deployment workflows.
- Implement Infrastructure as Code (IaC) solutions utilizing Terraform on AWS to provision and manage cloud infrastructure at scale.
- Operate and maintain monitoring ecosystems including Open Search/Elasticsearch, Prometheus, Grafana, Splunk, and Kafka, ensuring high availability and performance.
- Develop automation solutions to detect, respond, and remediate production issues proactively.
- Ensure security and compliance by managing vulnerability patching and automating security best practices in container environments.
- Collaborate with cross-functional teams to improve system reliability, scalability, and performance, contributing to distributed system design.
- Participate in on‑call rotations, incident response, and post‑incident analysis to uphold SLA commitments.
- Utilize AI‑assisted coding and troubleshooting tools to accelerate system development, automation, and incident resolution.
- Bachelor's degree in Computer Science, Information Technology, or related field.
- Minimum of 8 years of experience in Dev Ops, SRE, or platform engineering roles supporting production cloud environments.
- Proven incident response experience, including alert triage, root cause analysis, and SLA management in 24/7 operations.
- Expertise in Infrastructure as Code principles with proficiency in Terraform, Ansible, or similar automation tools for cloud provisioning.
- Strong scripting skills in Python, Golang, or Bash for automation, tooling, and CI/CD pipeline integration.
- Extensive experience operating and troubleshooting large‑scale Kubernetes workloads, including Helm chart management and multi‑cluster orchestration.
- Hands‑on knowledge of observability stacks such as Open Search, Prometheus, Grafana, Loki, and Splunk, including query optimization and capacity planning.
- Familiarity with Kafka and AWS MSK, including cluster operation, topic configuration, and schema management.
- Experience deploying, managing, and migrating Splunk Enterprise environments with Kubernetes‑based log shipping architectures.
- Working knowledge of Open Telemetry, distributed tracing, and application performance monitoring in cloud environments.
- Understanding of security frameworks, container hardening practices, and vulnerability remediation at scale, including standards such as FedRAMP, STIG, IL5, ISO 27001, and SOC 2.
- Experience using AI tools like LLMs, Git Hub Copilot, or custom AI agents to enhance operational workflows and incident management.
- Effective communication skills and the ability to work independently in a hybrid work setting.
Publishing Pay Range: $65.00 - $67.00 hourly
This position offers a hybrid schedule, with time split between the office and remote work.
#J-18808-LjbffrTo View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×