Job Description & How to Apply Below
SQL, NOSQL, Nagios, Cloudwatch, Zabbix, Datadog, New Relic, Prometheus, Grafana,
App Dynamics, Site
24x7, Telemetry, Splunk, CI CD, CI/CD, CICD, Dev Ops, Kentico,
SRE, Site Reliability, AIOps, Agentic, GEN AI, AI, ML
Experience Range:
10 - 16 years
Key Responsibilities:
• Design, develop and maintain observability, monitoring, and alerting systems for AI
platforms and mission-critical backend services.
• Design telemetry pipelines, logging infrastructure, and metrics dashboards using tools
such as Splunk, Prometheus, Grafana, and Open Telemetry.
• Define and maintain SLOs, SLIs, and real-time health indicators across platform
services and APIs.
• Participate in on-call rotations and lead the resolution of high-impact incidents,
including root cause analysis and postmortem reporting.
• Collaborate with platform engineering teams to enforce governance, compliance, and
security standards in production environments.
• Enhance deployment pipelines, CI/CD workflows, and infrastructure automation (e.g.,
Git Lab).
• Optimize and scale infrastructure components such as Kafka, HAProxy, RMQ,
databases, and distributed APIs.
• Support capacity planning, cost analysis, and system tuning to improve platform
performance.
• Advocate for automation-first operations, reducing manual toil through scripting and
reliability tooling.
• Create and maintain documentation, runbooks, and knowledge-sharing resources
across SRE and engineering teams.
• Mentor junior engineers and foster a culture of technical rigor and continuous
improvement.
Qualifications:
• Bachelor’s degree in computer science, Engineering, or a related field (Master’s
preferred).
• 10+ years of experience in SRE, Dev Ops, or operations engineering in cloud-based
environments.
• Hands-on experience with monitoring, alerting, and incident response in distributed
systems.
• Strong coding and scripting skills in Python, Java, or shell scripting languages such as
Bash or Power Shell.
• Solid understanding of database principles and experience with distributed storage
solutions such as Oracle, Cassandra, SOLR, and Kafka.
• Proficiency in CI/CD pipelines and Git Lab workflows.
• Strong working knowledge of SQL and No
SQL databases, including Oracle and
Cassandra.
• Expertise in Linux, networking concepts (TLS/SSL, DNS, load balancers), and
troubleshooting large-scale environments.
• Familiarity with AI/ML systems, APIs, and modern LLM tooling is a strong plus.
• Expertise in observability tools such as Splunk, Grafana, and Prometheus.
• Experience with Kubernetes, container orchestration, and hybrid/multi-cloud
deployments (Azure preferred; AWS/GCP/OCI acceptable).
• Deep understanding of security concepts and protocols, including authentication,
authorization, encryption, SSL/TLS, SSH/SFTP, PKI, X.509 certificates, and PGP.
• Excellent knowledge of ITIL/Service Now terminology for incident and problem
management.
• Proven ability to work in fast-paced, incident-driven environments with high uptime
requirements.
Preferred Qualifications:
• Experience supporting AI workloads, model inference systems, or LLM-enabled
platforms.
• Exposure to AIOps or related ML platform observability and reliability practices.
• Familiarity with Lang Chain, OpenAI, Spring AI, and MCP Server is a strong plus.
• Experience in highly regulated telecom environments with compliance and audit
controls.
• Understanding of AI Gateway patterns and secure API orchestration.
• Background in building secure, zero-downtime platforms with enterprise-scale SLAs.
Knowledge, Skills, and Abilities:
• Strong grasp of SRE best practices, including SLOs, SLIs, postmortems, and chaos
engineering.
• Ability to diagnose system bottlenecks across infrastructure, application, and network
layers.
• Expertise in driving automation across observability, configuration, and deployment
domains.
• Excellent communication and collaboration skills in cross-functional technical teams.
• Curiosity-driven mindset with a passion for learning emerging AI technologies and
improving system reliability.
• Strong commitment to automating processes for proactive monitoring, anomaly
detection, and alerting.
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×