×
Register Here to Apply for Jobs or Post Jobs. X

Manager, IT​/Tech

Job in Campus, Livingston County, Illinois, 60920, USA
Listing for: Infinite Computer Solutions
Full Time, Seasonal/Temporary position
Listed on 2025-12-02
Job specializations:
  • IT/Tech
    Cloud Computing, SRE/Site Reliability
Job Description & How to Apply Below
Location: Campus

Job Description

Location:

Bangalore

Job Type: Full-time

Job Summary

We're seeking a motivated, and passionate Site Reliability Engineering (SRE) leader with strong expertise in programming, distributed systems, and Kubernetes. In this role, you'll help evolve our SRE team's Kubernetes and microservices architecture, while also supporting the integration of Agentic AI workloads both within Kubernetes and via managed services. The SRE function plays a critical role in maintaining system visibility, ensuring platform scalability, and enhancing operational efficiency.

As part of this, you'll help drive AIOps initiatives, leveraging AI tools and automation to proactively detect, diagnose, and remediate issues, enhancing the reliability and performance of Zyter’s global platform. As a cloud practitioner, you’ll have the opportunity to apply your technical strengths, shape platform reliability strategies, and collaborate closely with engineering teams across the organization. You’ll work as part of a globally distributed, inclusive team focused on AWS-based cloud infrastructure.

Key Responsibilities Core SRE
  • Collaborate with development teams, product owners, and stakeholders to define, enforce, and track SLOs and manage error budgets.
  • Improve system reliability by designing for failure, testing edge cases, and monitoring key metrics.
  • Boost performance by identifying bottlenecks, optimizing resource usage, and reducing latency across services.
  • Build scalable systems that handle growth in traffic or data without compromising performance.
  • Stay directly involved in technical work, contributing to the codebase and leading by example in solving complex infrastructure challenges.
AI Ops
  • Design and implement scalable deployment strategies optimized for large language models like Llama, Claude, Cohere and others.
  • Set up continuous monitoring for model performance, ensuring robust alerting systems are in place to catch anomalies or degradation.
  • Stay current with advancements in MLOps and Generative AI, proactively introducing innovative practices to strengthen AI infrastructure and delivery.
Monitoring And Alerting
  • Set up monitoring and observability using Prometheus, Grafana, Cloud Watch, and logging with Open Search/ELK.
  • Proactively identify and resolve issues by leveraging monitoring systems to catch early signals before they impact operations.
  • Design and maintain alerting mechanisms that are clear, actionable, and tuned to avoid unnecessary noise or alert fatigue.
  • Continuously improve system observability to enhance visibility, reduce false positives, and support faster incident response.
  • Apply best practices for alert thresholds and monitoring configurations to ensure reliability and maintain system health.
Cost Management
  • Monitor infrastructure usage to identify waste and reduce unnecessary spending.
  • Optimize resource allocation by using right-sized instances, auto-scaling, and spot instances where appropriate.
  • Implement cost-aware design practices during architecture and deployment planning.
  • Track and analyze monthly cloud costs to ensure alignment with budget and forecast.
  • Collaborate with teams to increase cost visibility and promote ownership of cloud spend.
Required Skills & Experience
  • Strong experience as SRE with a proven track record of managing large-scale, highly available systems.
  • Knowledge of core operating system principles, networking fundamentals, and systems management.
  • Strong understanding of cloud deployment and management practices.
  • Hands‑on experience with Terraform/Open Tofu, Helm, Docker, Kubernetes, Prometheus and Istio.
  • Hands‑on experience with tools and techniques to diagnose and uncover container performance.
  • Skilled with AWS services both from technology and cost perspectives.
  • Skilled in Dev Ops/SRE practices and build/release pipelines.
  • Experience working with mature development practices and tools for source control, security, and deployment.
  • Hands‑on experience with Python/Golang/Groovy/Java.
  • Excellent communication skills, written and verbal.
  • Strong analytical and problem‑solving skills.
Preferred Qualifications
  • Experience scaling Kubernetes clusters and managing ingress traffic.
  • Familiarity with multi‑environment deployments and automated workflows.
  • Knowledge of AWS service quotas, cost optimization, and networking nuances.
  • Strong troubleshooting skills and effective communication across teams.
  • Prior experience in regulated environments (HIPAA, SOC2, ISO
    27001) is a plus.
Qualifications

Graduate

Range Of Year Experience – Minimum: 10 years

Range Of Year Experience – Maximum: 15 years

Seniority level

Mid-Senior level

Employment type

Full-time

Job function

Other

Industries

IT Services and IT Consulting

#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary