Senior Platform Engineer, Observability and AIOps
Job in
Sunnyvale, Santa Clara County, California, 94087, USA
Listed on 2026-06-21
Listing for:
Synopsys Inc
Full Time
position Listed on 2026-06-21
Job specializations:
-
IT/Tech
SRE/Site Reliability, Systems Engineer, Cloud Computing: Infrastructure & Operations
Job Description & How to Apply Below
You Are
You are a strong platform engineer with a passion for building platforms and services that improve how complex infrastructure is observed, understood, and operated. You bring experience developing solutions for observability, automation, and operational intelligence in large-scale enterprise environments. You are comfortable working across software, systems, and operations domains, and you enjoy solving difficult technical problems at scale.
What You'll Be Doing- Design, develop, and enhance software solutions that support observability, operational analytics, and intelligent automation across infrastructure and platform services.
- Build scalable, reliable, high-performance systems and services for telemetry collection, processing, searching, correlation, analysis, and visualization.
- Develop tools, APIs, and integrations that enhance monitoring, alerting, incident management, and operational workflow automation.
- Create software capabilities that improve visibility across operating systems, orchestration platforms, compute infrastructure, storage, networking, cloud services, and business-critical enterprise platforms.
- Partner with infrastructure, SRE, platform engineering, and operations teams to identify observability gaps and implement scalable solutions.
- Apply Infrastructure as Code practices to deploy, configure, and maintain observability components in a consistent and repeatable way.
- Apply data-driven techniques, AI-assisted methods, or intelligent analytics to improve signal quality, anomaly detection, alert prioritization, and root cause analysis.
- Document technical designs, implementation patterns, and operating procedures to boost teamwork productivity and efficiency within the organization.
- Enable faster incident response and resolution across a global hybrid-cloud infrastructure environment that supports mission-critical engineering and business workflows.
- Reduce operational complexity and alert fatigue by building intelligent systems that surface actionable signals instead of noise.
- Improve infrastructure reliability and uptime by making it easier for teams to see, understand, and act on what is happening in real time.
- Accelerate troubleshooting and root cause analysis by correlating telemetry across compute, storage, networking, and cloud platforms.
- Increase operational efficiency by automating repetitive triage, escalation, and remediation workflows.
- Empower SRE and platform teams with better tooling, better visibility, and better data so they can focus on high-value work instead of firefighting.
- Contribute to a culture of operational excellence where observability and intelligent automation are first-class engineering priorities.
- 8-10 years of experience in software engineering, platform engineering, site reliability engineering, or infrastructure engineering, including substantial experience building observability capabilities.
- Proven experience working in large-scale infrastructure environments with thousands of high-performance compute nodes and/or petabyte-scale storage.
- Strong hands-on experience designing, implementing, and operating observability platforms using technologies such as Elastic, Grafana, Kafka, Logstash, Open Telemetry, and Prometheus.
- Strong scripting and programming skills in Python, Ruby, or Bash for custom tool and ETL process development.
- Solid working knowledge of Linux systems, Kubernetes, and containerized application environments.
- Experience with Infrastructure as Code and configuration management tools such as Ansible, and experience with incident management platforms like Service Now, Rootly, or Pager Duty is a plus.
- Practical knowledge of AI technologies, including machine learning, generative AI, LLM-based tools, or intelligent analytics, with experience applying them to observability, incident response, automation workflows, or operational decision-making.
- Bachelor's or Master's degree in Computer Science, Information Technology, or a related engineering field.
- You can enter into a room of SREs drowning in alerts and leave with a tooling plan that changes how they understand the problem, not just how they…
Position Requirements
10+ Years
work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×