×
Register Here to Apply for Jobs or Post Jobs. X

SRE Lead - Fulltime - Austin, TX

Job in Austin, Travis County, Texas, 78753, USA
Listing for: Exaways Corporation
Full Time position
Listed on 2026-06-02
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing, SRE/Site Reliability, IT Support
Job Description & How to Apply Below
Job Description:

We are currently seeking a highly skilled SRE hands-on Lead Engineer with solid experience to help lead transformational initiatives within IT operations, encompassing development as well. As a crucial figure in this role, you will participate/help design and implement cutting-edge SRE solutions, driving the transformation of IT operations organizations to adopt an engineering-centric approach.

Responsibilities:

• Participate in design, architecture of reliable, scalable, and high-performance systems and services with a focus on operational excellence, availability, and performance.

• Primary skillset to be expertise in Observability as service, Telemetry data collection using Dynatrace APM, Solar Winds, Open-Source tools (Prometheus and Grafana), Log Aggregations (Kibana or Splunk) and AIOPS Tools

• Deeper understanding of Login authentication mechanisms using Ping, Forge Rock and Site Minder technologies (session management and cookie management)

• Correlation mechanisms and dashboards to have end to end visibility of requests from external to internal applications.

• Evangelize SRE evolution within IT operations and promoting a culture of engineering excellence and best practices.

• Define best practices and principles for SRE, including incident management, monitoring, alerting, and automation.

• Collaborate with development teams on resiliency to ensure that services and applications are designed with operational reliability in mind.

• Implement monitoring systems to assess the performance of applications and infrastructure, and proactively identifying areas for optimization.

• Understanding incident and problem management processes, post-mortems, and driving improvements to prevent future incidents.

• Analyze resource utilization patterns and forecasting future capacity needs to ensure optimal performance and cost-efficiency.

• Ensure that SRE practices align with security and compliance requirements and implement measures to protect systems and data.

• Operational excellence with focus on automation and developing tools to streamline operational tasks and increase efficiency.

• Provide guidance and mentorship to SRE teams, fostering skill development, and building a strong and capable SRE practice.

• Ability to develop close relationships with other operational teams to integrate SRE practices and drive overall operational improvements across enterprises.

• Stay up to date on industry trends, new technologies, and best practices in SRE and apply relevant advancements to the organization.

Qualifications:

• Around 10-12 years of SRE hands on experience with cloud technologies, development, SRE toolsets and automation

• Primary skill set to be expertise in Observability as service, Telemetry data collection using Dynatrace APM, Solar Winds, Open-Source tools (Prometheus and Grafana), Log Aggregations (Kibana or Splunk) and AIOPS Tools

• Deeper understanding of Login authentication mechanisms using Ping, Forge Rock and Site Minder technologies (session management and cookie management)

• Correlation mechanisms and dashboards to have end to end visibility of requests from external to internal applications.

• Strong hands-on experience with any Cloud Technology (AWS):
Control Tower, Project Setup, Creating Accounts, RDS, SSO

• Solid understanding and hands on experience with Docker/Kubernetes

• Should have good experience with Linux Commands, Git Lab CICD Setup and Terraform (state management, etc)

• Monitoring & alerting setup experience with Splunk, Prometheus, Grafana, Kibana, ELK etc.

• Hands on APM Tool/s experience, preferably Datadog or App Dynamics or Dynatrace

• Good understanding of Observability Framework leveraging programmatic SLI/SLO blueprints to standardize the collection of golden signals.

• Should have automation (data refresh, releases, DB snapshots) experience using Ansible or any other scripting languages

• Experience with following languages (Groovy-DSL, Java, Python, Yaml and microservices architecture)

• Good understanding and hands on experience with MQ, Kafka

• Experience with Databases (Oracle, MySQL)
Good to have:
Any of the relevant professional certifications - Certified Site Reliability Engineer (CSRE), Certified Kubernetes Administrator (CKA), AWS Certified Dev Ops Engineer Professional, , Google Cloud Professional;
Dev Ops Engineer
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary