Site Reliability and Operations Engineer
Listed on 2025-11-27
-
IT/Tech
Systems Engineer, Cloud Computing, IT Support, SRE/Site Reliability
04/29/2025
Contract
Active
Job Description:Job Summary:
We are seeking a highly skilled Site Reliability and Operations Engineer (SRE) with a robust background in Kubernetes-based distributed caching and compute grid systems. The ideal candidate will possess a solid blend of infrastructure engineering and software development skills. This role will focus on the design, implementation, and maintenance of high-performance distributed platforms to ensure high availability, scalability, and system observability.
Job Responsibilities:
Development & Implementation:
Design, build, and enhance distributed caching and compute grid solutions on Kubernetes/Open Shift platforms.
Leverage technologies such as IBM Spectrum Symphony, Tibco Grid Server, or similar for high-throughput compute grids.
Utilize containerization tools (Docker, Helm) to orchestrate microservices and container workloads.
Apply parallel compute strategies and optimize load balancing for application performance.
Site Reliability Engineering (SRE):
Ensure platform reliability, scalability, and minimal downtime by maintaining robust distributed systems.
Implement and maintain observability and monitoring using Prometheus, Grafana, ELK, or Open Telemetry.
Automate infrastructure provisioning and deployments using Ansible, Helm Charts, and similar tools.
Troubleshoot complex system and infrastructure issues in Kubernetes environments.
Support CI/CD processes using tools like Jenkins, ArgoCD, and Git Hub Actions.
Required Skills &
Qualifications:
- Strong experience with Kubernetes, including Open Shift, across both on-prem and cloud environments.
- Proficiency in at least one programming language:
Java, Go, or Python. - In-depth knowledge of containerization technologies such as Docker and Helm.
- Hands-on experience with CI/CD tools and pipeline integration.
- Expertise in observability and monitoring using Prometheus, Grafana, Loki, Jaeger.
- Knowledge of service meshes like Istio or Linkerd.
- Experience in multi-cluster and hybrid cloud Kubernetes deployments.
- Solid understanding of networking, security practices, and performance optimization in distributed systems.
- Experience with high-performance computing platforms or grid computing frameworks.
- Familiarity with distributed caching strategies and data sharding.
- Strong communication and documentation skills.
- Relevant certifications (e.g., CKAD, CKA, Red Hat Certified Specialist in Open Shift).
* This field is required Please enter valid email
Id.
Cell phone
* This field is required Please enter valid cell phone.
First Name
* This field is required Please enter valid first name.
Last Name
* This field is required Please enter valid last name.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).