Machine Learning Operations Engineer II Job New York New York USA,Software Development

Location: New York

Kensho is S&P Global’s hub for AI innovation and transformation. With expertise in machine learning, natural language processing, and data discovery, we develop and deploy novel solutions to innovate and drive progress at S&P Global and its customers worldwide. Kensho's solutions and research focus on business and financial generative AI applications, agents, data retrieval APIs, data extraction, and much more.

At Kensho, we hire talented people and give them the autonomy and support needed to build amazing technology and products. We collaborate using our teammates' diverse perspectives to solve hard problems. Our communication with one another is open, honest, and efficient. We dedicate time and resources to explore new ideas, but always rooted in engineering best practices. As a result, we can innovate rapidly to produce technology that is scalable, robust, and useful.

The MLOps team is the de facto ML platform team team’s mission is critical: empower our ML engineers with state-of-the-art processes, tooling, and infrastructure to iterate quickly, build reliably, and identify potential production issues early. We sit at the intersection of infrastructure and ML, and work closely with all our ML teams (ML Product teams, R&D, …) and our infrastructure teams (Core Infra, SRE, Security).

We are a small and high-leverage team: our work practically touches every AI project balance pragmatic platform development with hands‑on exploration at the frontier: building agentic applications ourselves, contributing to open‑source tools, and defining what a mature agentic platform looks like before the industry has settled on the answers. You’re equally likely to find us at a top ML conference (NeurIPS, ICLR, ICML) and at major software and infra conferences (Amazon Re:invent, PyCon).

To illustrate the point, within the same month, the same engineer went from reimplementing a prompt optimization research paper to shipping prometheus alerts.

Kensho states that the anticipated base salary range for the position is 130 –175k. In addition, this role is eligible for an annual incentive bonus and equity plans. At Kensho, it is not typical for an individual to be hired at or near the top of the range for their role and compensation decisions are dependent on the facts and circumstances of each case.

What You’ll Do:

Iterate on Kensho’s ML processes to develop tools, services, and frameworks that make every stage of the ML workflow robust, auditable, and usable.
Work closely with ML engineers to understand their unique processes, identify pain points, and form effective solutions.
Empower engineers with the stable tooling necessary to rapidly experiment and actualize their research into demonstrable prototypes and mature products.
Provide resources and training for ML teams on best practices, enabling them to efficiently product ionize their work to be leveraged by high‑value products and services.
Evaluate, select and champion open source and third‑party solutions, driving their adoption across teams and integrating into Kensho’s existing platform ecosystem.
Ship scalable, efficient, and automated processes for model fine‑tuning and reinforcement learning and for the evaluation of LLMs/Agents.
Improve LLM and Agentic observability to help monitor agentic applications in production, detecting performance, decay and drift issues.
Stay at the frontier by actively tracking emerging tools and frameworks, promote best practices and strengthen the technical expertise of the team with your unique skill set.

What You’ll Need:

2+ years of experience in ML infra, ML Ops, ML Engineering or some similar skillset.
Experience managing distributed systems with Kubernetes. It is important to understand Kubernetes concepts and trade‑offs.
Cloud Platform (AWS) understanding. We utilize tools like EKS and managed ML services like Bedrock and Sage Maker.
Python proficiency (we are a python shop mostly).
Familiarity with distributed computing frameworks and workflow orchestration (ie. Ray, Airflow).
Familiarity with software engineering best practices in an ML context.
Some basic understanding of ML concepts, LLMs and agents.
Ability to debug…