×
Register Here to Apply for Jobs or Post Jobs. X

SRE DevOps Engineer

Job in Frisco, Collin County, Texas, 75034, USA
Listing for: Highbrow LLC
Full Time position
Listed on 2026-05-30
Job specializations:
  • IT/Tech
    SRE/Site Reliability, Systems Engineer, Cloud Computing
Salary/Wage Range or Industry Benchmark: 110000 - 130000 USD Yearly USD 110000.00 130000.00 YEAR
Job Description & How to Apply Below

SRE Dev Ops Engineer

Location:

Overland Park, KS / Atlanta, GA / Frisco, TX (Onsite)

Requirements
  • 4–9 years in SRE/Dev Ops/Systems Engineering as Senior or Principal Engineer
  • Strong hands‑on experience with Kubernetes, container orchestration, and API management.
  • Working knowledge of WAFs, networking security, and database technologies (SQL/No

    SQL).
  • Proficient in automation and scripting (Python, Go, Ansible, Terraform, etc.)
  • Strong observability/monitoring experience.
  • Experience with CI/CD pipelines, Git Ops, and infrastructure as code.
  • Solid problem‑solving and collaboration skills.
Job Responsibilities
  • Resolve escalated incidents across Kubernetes, API Proxy, WAF, DBs, and infra platforms.
  • Design and improve runbooks, automating manual steps wherever possible.
  • Lead and contribute to building self‑healing systems and self‑service tooling for users.
  • Analyze incident trends, propose improvements in monitoring, capacity, and reliability.
  • Collaborate with engineering teams on deployment, upgrades, and performance optimization.
  • Conduct postmortems, document RCA, and ensure learning is captured.
  • Mentor and coach L1 engineers.
Skills Mandatory Skills (Must-Have)
  • Advanced Incident Troubleshooting & Resolution

    Expectation:
    Diagnose and resolve escalated incidents that L1 cannot handle, often across multiple layers (infrastructure, application, network).

    Example:
    For an API outage, identify if the root cause is in Kubernetes pod networking, API gateway misconfig, or backend DB latency — and apply fixes.

  • Kubernetes & Container Orchestration Expertise

    Expectation:
    Comfortable with deployments, scaling, networking, and debugging cluster‑level issues.

    Example:
    Troubleshoot why pods are pending by checking node capacity, taints/tole rations, and cluster autoscaler logs.

  • Automation & Scripting (Python, Go, Bash, Ansible, Terraform)

    Expectation:
    Write scripts and automation to reduce manual toil, enhance monitoring, and improve incident resolution speed.

    Example:
    Develop a Python script to automatically collect pod and system logs when a service crashes.

  • Observability & Monitoring Tooling

    Expectation:
    Deep understanding of monitoring, alerting, tracing, and logging systems.

    Example:
    Build Prometheus alert rules to detect DB query spikes; configure Grafana dashboards for API latency.

  • CI/CD & Infrastructure as Code (IaC)

    Expectation:
    Familiarity with Git Ops workflows, CI/CD pipelines, and infrastructure provisioning.

    Example:
    Enhance Jenkins pipeline to add automated smoke tests before promoting Kubernetes deployments.

  • Database Troubleshooting (SQL & No

    SQL)

    Expectation:
    Identify performance bottlenecks, connection issues, and basic tuning opportunities.

    Example:
    Run queries to detect slow‑running SQL statements causing latency in an application.

  • Incident Management & RCA

    Expectation:
    Act as incident commander for escalated issues, lead bridge calls, and produce Root Cause Analyses.

    Example:
    After a WAF misconfiguration causes downtime, lead the investigation, document the timeline, and propose preventive actions.

  • Mentorship & Runbook Improvement

    Expectation:
    Coach L1 engineers, refine runbooks, and introduce new automated workflows.

    Example:
    Update a runbook to add automated Kubernetes log collection instead of manual steps.

  • Preferred Skills (Nice-to-Have)
  • Cloud Platform Engineering (AWS, Azure, GCP)

    Expectation:
    Hands‑on skills in provisioning, scaling, and securing cloud workloads.

    Example:
    Diagnose why an AWS ALB is misrouting traffic after a deployment.

  • Security & WAF Management

    Expectation:
    Understand WAF rules, common attacks (SQL injection, XSS), and how to apply fixes.

    Example:
    Investigate false positives in WAF logs and adjust rule sets with security teams.

  • Capacity & Performance Engineering

    Expectation:
    Anticipate scaling needs, tune resource utilization, and propose optimizations.

    Example:
    Identify that a Kubernetes deployment is CPU‑throttled and adjust HPA (Horizontal Pod Autoscaler) configs.

  • Automation Platform Integration (AIOps, Chat Ops)

    Expectation:
    Integrate AI/ML‑powered tools for anomaly detection and auto‑remediation.

    Example:
    Implement a Chat Ops bot that runs predefined Kubernetes troubleshooting commands in Slack.

  • Cross‑Platform Expertise (Hybrid Infra)

    Expectation:
    Experience supporting both on‑prem and cloud environments seamlessly.

    Example:
    Compare latency patterns between on‑prem DBs and cloud‑hosted APIs to identify bottlenecks.

  • #J-18808-Ljbffr
    To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
    (If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
     
     
     
    Search for further Jobs Here:
    (Try combinations for better Results! Or enter less keywords for broader Results)
    Location
    Increase/decrease your Search Radius (miles)
    0
    200
    Filters
    Education Level
    Experience Level (years)
    Posted in last:
    Salary