×
Register Here to Apply for Jobs or Post Jobs. X

Senior Platform Engineer, Observability and AIOps

Job in Sunnyvale, Santa Clara County, California, 94087, USA
Listing for: Synopsys Inc
Full Time position
Listed on 2026-06-21
Job specializations:
  • IT/Tech
    SRE/Site Reliability, Systems Engineer, Cloud Computing: Infrastructure & Operations
Salary/Wage Range or Industry Benchmark: 125000 - 150000 USD Yearly USD 125000.00 150000.00 YEAR
Job Description & How to Apply Below

You Are

You are a strong platform engineer with a passion for building platforms and services that improve how complex infrastructure is observed, understood, and operated. You bring experience developing solutions for observability, automation, and operational intelligence in large-scale enterprise environments. You are comfortable working across software, systems, and operations domains, and you enjoy solving difficult technical problems at scale.

What You'll Be Doing
  • Design, develop, and enhance software solutions that support observability, operational analytics, and intelligent automation across infrastructure and platform services.
  • Build scalable, reliable, high-performance systems and services for telemetry collection, processing, searching, correlation, analysis, and visualization.
  • Develop tools, APIs, and integrations that enhance monitoring, alerting, incident management, and operational workflow automation.
  • Create software capabilities that improve visibility across operating systems, orchestration platforms, compute infrastructure, storage, networking, cloud services, and business-critical enterprise platforms.
  • Partner with infrastructure, SRE, platform engineering, and operations teams to identify observability gaps and implement scalable solutions.
  • Apply Infrastructure as Code practices to deploy, configure, and maintain observability components in a consistent and repeatable way.
  • Apply data-driven techniques, AI-assisted methods, or intelligent analytics to improve signal quality, anomaly detection, alert prioritization, and root cause analysis.
  • Document technical designs, implementation patterns, and operating procedures to boost teamwork productivity and efficiency within the organization.
The Impact You Will Have
  • Enable faster incident response and resolution across a global hybrid-cloud infrastructure environment that supports mission-critical engineering and business workflows.
  • Reduce operational complexity and alert fatigue by building intelligent systems that surface actionable signals instead of noise.
  • Improve infrastructure reliability and uptime by making it easier for teams to see, understand, and act on what is happening in real time.
  • Accelerate troubleshooting and root cause analysis by correlating telemetry across compute, storage, networking, and cloud platforms.
  • Increase operational efficiency by automating repetitive triage, escalation, and remediation workflows.
  • Empower SRE and platform teams with better tooling, better visibility, and better data so they can focus on high-value work instead of firefighting.
  • Contribute to a culture of operational excellence where observability and intelligent automation are first-class engineering priorities.
What You'll Need
  • 8-10 years of experience in software engineering, platform engineering, site reliability engineering, or infrastructure engineering, including substantial experience building observability capabilities.
  • Proven experience working in large-scale infrastructure environments with thousands of high-performance compute nodes and/or petabyte-scale storage.
  • Strong hands-on experience designing, implementing, and operating observability platforms using technologies such as Elastic, Grafana, Kafka, Logstash, Open Telemetry, and Prometheus.
  • Strong scripting and programming skills in Python, Ruby, or Bash for custom tool and ETL process development.
  • Solid working knowledge of Linux systems, Kubernetes, and containerized application environments.
  • Experience with Infrastructure as Code and configuration management tools such as Ansible, and experience with incident management platforms like Service Now, Rootly, or Pager Duty is a plus.
  • Practical knowledge of AI technologies, including machine learning, generative AI, LLM-based tools, or intelligent analytics, with experience applying them to observability, incident response, automation workflows, or operational decision-making.
  • Bachelor's or Master's degree in Computer Science, Information Technology, or a related engineering field.
Who You Are
  • You can enter into a room of SREs drowning in alerts and leave with a tooling plan that changes how they understand the problem, not just how they…
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary