×
Register Here to Apply for Jobs or Post Jobs. X

Senior Platform Engineer, Metal Dev

Job in New York City, Richmond County, New York, 10261, USA
Listing for: CoreWeave
Full Time position
Listed on 2025-12-01
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing, IT Support, SRE/Site Reliability
Salary/Wage Range or Industry Benchmark: 153000 - 242000 USD Yearly USD 153000.00 242000.00 YEAR
Job Description & How to Apply Below

Core Weave is The Essential Cloud for AI™. Built for pioneers by pioneers, Core Weave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, Core Weave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, Core Weave became a publicly traded company (Nasdaq: CRWV) in March 2025.

Learn more at

About

The Role

Core Weave is seeking a highly skilled and motivated Senior Platform Engineer to join our Hardware Engineering Dev team (Metal Dev). Reporting to the Engineering Manager for Hardware Engineering Dev, you will play a crucial part in the development, deployment, and monitoring of services that manage our bare-metal infrastructure. You will collaborate closely with cross-functional teams, external vendors, and other stakeholders to ensure the successful delivery of highly performant and reliable infrastructure solutions.

Key Responsibilities
  • Incident Management & Support:
    • Lead incident response efforts by identifying and resolving service disruptions quickly, while coaching other junior team members through resolution.
    • Lead the documentation of incidents, conduct in-depth root cause analysis (RCA), and drive post-incident reviews (PIRs) to identify systemic issues. Implement long term improvements that would prevent service degradation.
    • Own the development and continuous improvement of incident response playbooks ensuring preparedness for a wide range of failure scenarios.
    • Clearly communicate efforts during incidents to the management, stakeholders and the cross functional teams, during an incident. Keep clear records of incident activities.
    • Master clear understanding of various services on how they work in production as well as build through knowledge of the internals of these services and how they interact with the entire stack.
  • Operational Support & Reliability:
    • Build a strategy around making our core services perform at its best s includes improvements to the services for robustness as well as supportability in production.
    • Own system observability and health leveraging tools like Prometheus and Grafana, to proactively detect performance bottlenecks and prevent incidents.
    • Lead automation efforts to streamline incident detection and recovery, minimizing manual intervention.
    • Define and drive KPIs and SLAs for incident management and ensuring alignment with the organizational reliability objectives.
    • Collaborate with engineers across teams to improve platform reliability, resilience improvements, and disaster recovery.
    • Collaborate with upstream communities, including Go and Redfish-based services.
    • Design and implement solutions to build operational efficiency and stability.
    • Document hardware automation workflows and processes.
    • Create CI/CD pipelines.
    • Ensure smooth operation of all aspects of the server hardware lifecycle, from provisioning to end-of-life, by troubleshooting bugs, automating common tasks, and documenting processes.
    • Partner with the Fleet Operations Team to design scalable tooling and processes that enables self-service and reduction in escalation overhead.
    • Build out dashboards and alerts to make efficient operational troubleshooting.
    • Participate in on-call rotation.
Minimum Qualifications
  • 7+ years of experience in cloud operations, site reliability engineering (SRE), or related technical roles.
  • Understanding of cloud platforms (e.g., Kubernetes, AWS, GCP) and basic knowledge of cloud infrastructure.
  • Familiarity with incident management practices and frameworks (e.g., ITIL, SRE best practices).
  • Proficiency with Go.
  • Prior experience with Prometheus / Grafana.
  • Previous experience deploying containerized applications using Kubernetes.
  • Excellent documentation skills and attention to detail.
  • Strong analytical and problem‑solving abilities.
  • Served on an on-call rotation supporting production services.
Compensation

The base pay for this position ranges from $153,000 to $242,000. Pay is based on a number of factors including market location and may vary depending on job‑related knowledge,…

Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary