Senior Machine Learning Operations Engineer Job Agra area,Madhya Pradesh India,IT/Tech

Location:

India

Employment Type:

Full‑time
About Brightly Software
Brightly Software is a leader in intelligent asset management and operational optimization, empowering organizations with data‑driven insights. As we expand our AI and ML capabilities, we are seeking a Senior MLOps Engineer to build and scale the infrastructure that powers our next generation of predictive and autonomous solutions.

Role Overview
As a Senior MLOps Engineer , you will architect, develop, and operate end‑to‑end machine learning infrastructure on AWS. You will work at the intersection of ML engineering, cloud infrastructure, and developer productivity—enabling Brightly's data science teams to move seamlessly from experimentation to reliable, secure, and cost‑efficient production systems.
Your work will ensure that ML models and data pipelines are scalable , observable , and compliant with best‑in‑class MLOps practices .

Key Responsibilities
ML Platform & Infrastructure (AWS‑focused)
Design, build, and operate ML/AI development platforms on AWS, leveraging services such as Amazon Sage Maker (Studio, Training, Real‑Time & Async Inference, Pipelines, Feature Store) , S3 , Glue , Lambda , ECS/EKS , and related cloud infrastructure.
Implement infrastructure‑as‑code using Terraform or equivalent, and manage workflow orchestration using AWS Step Functions or Airflow . Data & Model Pipelines
Build automated data ingestion and transformation pipelines using S3, Glue, EMR/Spark, and Redshift , incorporating data quality and lineage tooling (e.g., Great Expectations, Deequ ).

CI/CD for Machine Learning
Develop CI/CD pipelines for ML with Code Build, Code Pipeline, or Git Hub Actions , integrating unit tests, data contract checks, model validation, canary/shadow deployments, and automated rollback strategies.
Model Deployment & Operations
Deploy real‑time inference endpoints (Sage Maker endpoints or FastAPI‑based services on Lambda/ECS/EKS) and scalable batch processing jobs.
Define SLOs, implement autoscaling, and drive cost/performance optimizations across ML workloads.
Monitoring, Observability & Governance
Implement production monitoring for drift, bias, and performance using Sage Maker Model Monitor and service telemetry tools like Cloud Watch , Prometheus , and Grafana .
Enforce security and governance best practices, including least‑privilege IAM , VPC‑isolated architectures, encryption, and secret management.
Cross‑Functional Collaboration
Partner closely with data scientists, ML engineers, and backend engineers to product ionize ML models and streamline development workflows.
Contribute to the integration of emerging GenAI workloads, including Amazon Bedrock , vector databases (e.g., Open Search ), and RAG pipelines.

Required Qualifications
Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
8+ years of professional experience in ML engineering, Dev Ops, cloud engineering, or MLOps roles, with at least 3 years in a senior or lead capacity.
3+ years of proven track record in designing and architecting robust, scalable ML systems and infrastructure in cloud environments, particularly on AWS.
5+ years of deep experience building on the AWS ML ecosystem , including Sage Maker, S3, Lambda, ECR, EKS/ECS, Step Functions, IAM, VPC networking, and CI/CD tooling.
3+ years of hands-on experience deploying, maintaining, and scaling ML models in production environments.
3+ years of strong Python development skills and familiarity with Docker‑based workflows.
5+ years of solid understanding of ML life cycles, model evaluation, and monitoring patterns.
5+ years of extensive experience with infrastructure‑as‑code (Terraform, Cloud Formation).
5+ years of expertise in designing system architecture for ML platforms, including microservices, container orchestration, and cloud networking.
3+ years of familiarity with MLOps best practices as defined by AWS and industry standards.
2+ years of experience with data quality frameworks (Great Expectations, Deequ).
2+ years of experience optimizing distributed training workflows on AWS.
3+ years of knowledge of security and compliance requirements for ML in enterprise settings, such as IAM, encryption, and secret management.
2+ years of experience with monitoring tools (Cloud Watch, Prometheus, Grafana) and implementing model observability solutions.
5+ years of effective cross-functional collaboration skills, working closely with data scientists, ML engineers, and software engineers to deliver production-grade ML solutions.
7+ years of excellent problem-solving and communication abilities, with a focus on delivering scalable, reliable, and cost-effective ML platforms.