SRE Lead - Fulltime - Austin, TX
Job in
Austin, Travis County, Texas, 78753, USA
Listed on 2026-06-02
Listing for:
Exaways Corporation
Full Time
position Listed on 2026-06-02
Job specializations:
-
IT/Tech
Systems Engineer, Cloud Computing, SRE/Site Reliability, IT Support
Job Description & How to Apply Below
We are currently seeking a highly skilled SRE hands-on Lead Engineer with solid experience to help lead transformational initiatives within IT operations, encompassing development as well. As a crucial figure in this role, you will participate/help design and implement cutting-edge SRE solutions, driving the transformation of IT operations organizations to adopt an engineering-centric approach.
Responsibilities:
• Participate in design, architecture of reliable, scalable, and high-performance systems and services with a focus on operational excellence, availability, and performance.
• Primary skillset to be expertise in Observability as service, Telemetry data collection using Dynatrace APM, Solar Winds, Open-Source tools (Prometheus and Grafana), Log Aggregations (Kibana or Splunk) and AIOPS Tools
• Deeper understanding of Login authentication mechanisms using Ping, Forge Rock and Site Minder technologies (session management and cookie management)
• Correlation mechanisms and dashboards to have end to end visibility of requests from external to internal applications.
• Evangelize SRE evolution within IT operations and promoting a culture of engineering excellence and best practices.
• Define best practices and principles for SRE, including incident management, monitoring, alerting, and automation.
• Collaborate with development teams on resiliency to ensure that services and applications are designed with operational reliability in mind.
• Implement monitoring systems to assess the performance of applications and infrastructure, and proactively identifying areas for optimization.
• Understanding incident and problem management processes, post-mortems, and driving improvements to prevent future incidents.
• Analyze resource utilization patterns and forecasting future capacity needs to ensure optimal performance and cost-efficiency.
• Ensure that SRE practices align with security and compliance requirements and implement measures to protect systems and data.
• Operational excellence with focus on automation and developing tools to streamline operational tasks and increase efficiency.
• Provide guidance and mentorship to SRE teams, fostering skill development, and building a strong and capable SRE practice.
• Ability to develop close relationships with other operational teams to integrate SRE practices and drive overall operational improvements across enterprises.
• Stay up to date on industry trends, new technologies, and best practices in SRE and apply relevant advancements to the organization.
Qualifications:
• Around 10-12 years of SRE hands on experience with cloud technologies, development, SRE toolsets and automation
• Primary skill set to be expertise in Observability as service, Telemetry data collection using Dynatrace APM, Solar Winds, Open-Source tools (Prometheus and Grafana), Log Aggregations (Kibana or Splunk) and AIOPS Tools
• Deeper understanding of Login authentication mechanisms using Ping, Forge Rock and Site Minder technologies (session management and cookie management)
• Correlation mechanisms and dashboards to have end to end visibility of requests from external to internal applications.
• Strong hands-on experience with any Cloud Technology (AWS):
Control Tower, Project Setup, Creating Accounts, RDS, SSO
• Solid understanding and hands on experience with Docker/Kubernetes
• Should have good experience with Linux Commands, Git Lab CICD Setup and Terraform (state management, etc)
• Monitoring & alerting setup experience with Splunk, Prometheus, Grafana, Kibana, ELK etc.
• Hands on APM Tool/s experience, preferably Datadog or App Dynamics or Dynatrace
• Good understanding of Observability Framework leveraging programmatic SLI/SLO blueprints to standardize the collection of golden signals.
• Should have automation (data refresh, releases, DB snapshots) experience using Ansible or any other scripting languages
• Experience with following languages (Groovy-DSL, Java, Python, Yaml and microservices architecture)
• Good understanding and hands on experience with MQ, Kafka
• Experience with Databases (Oracle, MySQL)
Good to have:
Any of the relevant professional certifications - Certified Site Reliability Engineer (CSRE), Certified Kubernetes Administrator (CKA), AWS Certified Dev Ops Engineer Professional, , Google Cloud Professional;
Dev Ops Engineer
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×