Tech Lead Data Scientist, AI Evaluation & Monitoring
Listed on 2026-05-06
-
IT/Tech
AI Engineer (Applied/Software)
Job Summary
The Tech Lead Data Scientist, AI Evaluation & Monitoring is the principal technical expert for how Geisinger evaluates, monitors, and optimizes AI systems in production. This hands‑on technical leadership role sets the technical direction for AI evaluation across a large portfolio, provides leadership to a team of data analysts, and partners directly with AI program teams to raise the quality of AI validation, monitoring, and improvement.
Job Duties- The technical evaluation methodology applied to AI programs across the enterprise – pre‑production validation, production monitoring, and ongoing optimization.
- Hands‑on guidance to program teams as they design validation studies, equity audits, monitoring plans, and escalation playbooks.
- Instrumentation of production monitoring: translating program‑specific failure modes into concrete, measurable metrics.
- The evaluation toolkit: LLM-as-Judge frameworks, golden sets, simulation harnesses, experimental study designs, drift detection, subgroup fairness analysis.
- Reusable evaluation playbooks and templates that let each new program move faster.
- Technical direction, design review, and mentorship for a team of data analysts supporting the evaluation function.
- People management, HR administration, or formal performance evaluations for the analyst team.
- Program‑level product strategy or go/no‑go decisions.
- Final clinical validation judgment on whether a given AI is safe for a specific clinical use.
- The software infrastructure behind the evaluation and monitoring tooling.
With program teams (hands‑on advisory). Partner with program owners early to shape study approach, sample size, stratification, gold‑standard definition, and decision thresholds. Translate ambiguous failure modes into concrete, defensible evaluation designs and coach teams through technical work.
With the evaluation toolkit (hands‑on build). Design and operate reusable assets that let evaluation scale: LLM‑as‑Judge rubrics and calibration methods, golden sets, simulation harnesses, A/B and shadow‑mode study templates, subgroup fairness analyses, and drift monitors.
With the analyst team (technical leadership). Set technical direction, assign work across active evaluations, review analysis code and study designs, and raise the technical bar. Mentor analysts on methodology, statistical rigor, and domain knowledge.
Methods You'll Use- Experimental and quasi‑experimental design for production AI systems.
- LLM and generative AI evaluation: golden sets, judge‑based evaluation, hallucination and grounding checks.
- Fairness and equity evaluation across patient and stakeholder subgroups.
- Production monitoring design: drift detection, performance decay, adoption, and outcome metrics.
- Causal inference methods appropriate to healthcare settings where full RCTs are impractical.
- Simulation and adversarial testing for pre‑production stress testing.
- Python, SQL, modern ML and evaluation tooling, cloud‑native data platforms.
Work is typically performed in an office or remote environment and requires compliance with all organization policies and procedures.
RequiredSkills & Qualifications
- 6+ years in data science, statistics, ML engineering, or applied quantitative research, with senior technical voice on cross‑functional projects.
- Strong foundation in experimental design and causal inference.
- Hands‑on experience designing and running model evaluation studies in real production settings.
- Experience evaluating LLM or generative AI systems, or comparable complex ML systems where ground truth is messy.
- Proven ability to translate ambiguous failure modes into concrete, defensible evaluation designs and monitoring metrics.
- Strong fluency in Python and SQL; comfort with modern ML tooling and cloud‑native data environments.
- Experience with fairness and equity evaluation for ML systems.
- Track record of providing technical leadership and mentorship without formal people‑management authority.
- Clear written communication – produces evaluation memos and specifications relied upon by non‑technical decision‑makers.
- Healthcare, clinical, or regulated‑industry experience strongly preferred.
- MS or PhD in a quantitative field preferred; equivalent experience accepted.
Bachelor's Degree – Related Field of Study (Required)
ExperienceMinimum of 6 years – Relevant experience (Required)
We are proud to be an affirmative action, equal‑opportunity employer, and all qualified applicants will receive consideration for employment regardless of race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or veteran status.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).