Lead AI Engineer - SRE,LLM Agents,Full-Stack Architecture Job Saskatoon area,Saskatchewan Canada,IT/Tech

About the Opportunity

A leading financial institution is seeking a highly experienced Lead AI Engineer to join its advanced technology division. This is a high-impact, leadership-track role at the intersection of AI engineering, Site Reliability, and enterprise-grade software architecture. The successful candidate will design, build, and operationalize the next generation of agentic AI systems within a regulated banking environment — driving intelligent automation while maintaining the rigorous security, compliance, and availability standards demanded by the financial services industry.

You will architect multi-agent LLM systems, implement Model Context Protocol (MCP) servers, build production-grade RAG pipelines, and lead AI observability practices using the ELK stack. This role requires deep technical expertise combined with the leadership acumen to mentor engineers and influence cross-functional technical decisions.

Pillar 1 — AI Architecture & Agentic Systems

Design and implement sophisticated LLM-powered agentic workflows and multi-agent architectures capable of autonomous reasoning, planning, and tool execution within secure financial system boundaries.
Architect and deploy scalable Model Context Protocol (MCP) servers to enable standardized, secure, and rich context management between AI models, internal banking APIs, and external data sources.
Develop production-grade Retrieval-Augmented Generation (RAG) and GraphRAG pipelines that ground AI agents in accurate, real-time enterprise financial data with full auditability.
Leverage expertise in Meta AI (Llama ecosystem), Google AI (Gemini, Vertex AI), and Microsoft Copilot to build and integrate cutting-edge AI features while adhering to financial data handling policies.
Implement prompt versioning, model drift detection, and automated evaluation pipelines to maintain AI system quality and regulatory compliance over time.

Pillar 2 — Full-Stack Engineering

Lead end-to-end development of robust, scalable AI applications using Node.js (Type Script) and Python (FastAPI/Django) — both languages are required.
Champion AI-assisted developer workflows ("Vibe Coding") using advanced tools such as Cursor and Git Hub Copilot to improve team productivity and code quality.
Design and implement secure, high-performance RESTful and GraphQL APIs to serve LLM inferences and agentic actions to frontend and downstream systems.
Develop and maintain Bash and Python automation scripts for infrastructure management, deployment orchestration, and operational efficiency.
Mentor junior and mid-level engineers in AI-native development practices and modern architectural patterns.

Pillar 3 — Site Reliability Engineering & AI Observability

Implement comprehensive observability stacks using the ELK Stack (Elasticsearch, Logstash, Kibana) specifically tuned for LLM performance metrics: latency, token usage, hallucination rates, and model drift indicators.
Apply SRE best practices to AI workloads — ensuring high availability, fault tolerance, incident response playbooks, and SLO/SLA management for LLM inference services.
Build and maintain CI/CD pipelines tailored for machine learning models, including prompt versioning, model evaluation gates, shadow deployments, and automated rollback.
Design alerting, on-call runbooks, and escalation paths for AI system incidents within a regulated financial services environment.

Required Qualifications

AI & Machine Learning - Deep understanding of LLM architectures, prompt engineering, fine-tuning techniques (LoRA/qLoRA), and embedding models. Proven experience building and operating production-grade LLM applications.
Agentic Frameworks - Hands-on experience designing autonomous agents and implementing Model Context Protocol (MCP) servers for standardized tool and context management.
RAG & Vector Databases - Strong experience building RAG and GraphRAG pipelines. Proficiency with vector databases (Pinecone, Milvus, or Weaviate) and embedding model selection strategies.
Observability & SRE - Extensive hands-on experience with the ELK Stack (Elasticsearch, Logstash, Kibana) for distributed system logging, monitoring, and AI-specific metrics tracking.
Cloud & Infrastructure - Proven experience with cloud-native architectures. Azure and AKS (Azure Kubernetes Service) experience strongly preferred for this engagement.
Enterprise AI Tools - Demonstrated expertise with Microsoft Copilot (Copilot Studio extensibility, custom connectors), Meta AI open-source models, and Google AI infrastructure (Gemini/Vertex AI).
Leadership - 8+ years of progressive software engineering experience. Minimum 3 years in a technical leadership or architectural role with a focus on AI/ML systems.

Banking & Compliance Requirements

Given the regulated nature of this environment, candidates must demonstrate awareness of and experience with the following:

Working knowledge of SOC 2 Type II compliance principles and their impact on AI system design and data handling.
Understanding of financial data…

Lead AI Engineer - SRE, LLM Agents, Full-Stack Architecture