More jobs:
AIOps Engineer
Job in
Stanford, Santa Clara County, California, 94305, USA
Listed on 2025-12-31
Listing for:
Select Source International
Full Time
position Listed on 2025-12-31
Job specializations:
-
IT/Tech
Cloud Computing, Systems Engineer
Job Description & How to Apply Below
Job Description
Location:
408 Panama Mall, Stanford, CA 94305 (Hybrid – 2 days on campus)
Duration: 12 months
Shift: 1st Shift (9am – 6pm)
Position OverviewThe AI‑Ops Engineer is a key technical contributor responsible for evolving traditional Dev Ops into AI‑Ops s role leverages AI and machine learning to automate and enhance IT operations, including performance monitoring, anomaly detection, root‑cause analysis, and automated remediation.
Key ResponsibilitiesAI‑Driven Operations & Automation
- Implement AI‑Ops solutions that use ML algorithms to automate performance monitoring, workload scheduling, and infrastructure management.
- Build anomaly detection systems that identify infrastructure issues before they impact users.
- Develop automated root‑cause analysis capabilities using ML to correlate events and filter noise from critical alerts.
- Create predictive maintenance workflows that analyze historical patterns to proactively mitigate issues.
- Design and implement automated remediation scripts that respond to incidents without human intervention.
Observability & Intelligent Monitoring
- Architect comprehensive observability platforms that aggregate data from disparate sources into unified dashboards.
- Implement intelligent alerting systems using NLP and ML to reduce alert fatigue and surface actionable insights.
- Build real‑time analytics dashboards for coordinated diagnosis across teams.
- Deploy application performance monitoring (APM) solutions integrated with AI‑driven analytics, ensuring end‑to‑end visibility across cloud infrastructure, applications, and AI/ML workloads.
Cloud Infrastructure & Dev Ops
- Design, build, and maintain scalable, secure AWS infrastructure using Infrastructure as Code (Cloud Formation, Terraform, or CDK).
- Implement and manage containerised environments using Docker, AWS ECS, Fargate, and Kubernetes (EKS).
- Build CI/CD pipelines for continuous delivery, integrating AI‑powered code quality and deployment optimisation.
- Manage cloud automation and optimisation to improve cost‑efficiency and resource utilisation.
- Ensure compliance with Stanford and regulatory standards (FERPA, GDPR) for secure data handling and governance.
Collaboration & Continuous Improvement
- Partner with cross‑functional teams to implement domain‑agnostic AI‑Ops solutions across the organisation.
- Use Git‑based version control and code review best practices as part of a collaborative, agile workflow.
- Document operational procedures, runbooks, and AI‑Ops workflows for team knowledge sharing.
- Continuously evaluate and adopt emerging AI‑Ops tools, AWS services, and AI‑driven automation technologies.
- Contribute to building an AI‑first operational culture that prioritises automation and predictive capabilities.
Required Qualifications
- Education &
Certifications:
Bachelor’s degree in Computer Science, Dev Ops, Cloud Engineering, or a related field (Master’s preferred). AWS certification preferred (Solutions Architect, Sys Ops Administrator, or Dev Ops Engineer); professional‑level certification is a plus. - Experience: 3+ years in Dev Ops, SRE, or Cloud Engineering. 2+ years of hands‑on AWS experience (EC2, ECS, Lambda, S3, IAM, VPC) and scaling monitoring/observability solutions.
- Familiarity: With ML/AI concepts and their application to operational automation.
- Languages: Python (required);
Bash, Go, or Type Script (preferred). - AIOps & Monitoring: Cloud Watch, X‑Ray, Prometheus, Grafana, Datadog, or Splunk with ML capabilities.
- Infrastructure as Code: AWS Cloud Formation, Terraform, or AWS CDK.
- Containers & Orchestration: Docker, AWS ECS/Fargate, Kubernetes (EKS).
- AWS Services: Lambda, EC2, S3, API Gateway, Event Bridge, Cloud Watch, IAM, VPC, Code Pipeline, Sage Maker.
- CI/CD Tools: Git Hub Actions, AWS Code Pipeline, Jenkins, or Git Lab CI.
- Data & Analytics: Log aggregation, metrics analysis, and event correlation platforms.
- Strong understanding of AI‑Ops principles – using AI to enhance, not just support, IT operations.
- Excellent problem‑solving, debugging, and root‑cause analysis skills.
- Rapid learning, adaptability, and continuous improvement mindset.
- Strong communication and collaboration skills with…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×