More jobs:
Senior Workflow Orchestration Engineer; Airflow & Scheduling Platforms
Job in
New York, New York County, New York, 10261, USA
Listed on 2026-03-07
Listing for:
Benton Partners
Full Time
position Listed on 2026-03-07
Job specializations:
-
IT/Tech
Systems Engineer, Cloud Computing, Data Engineer
Job Description & How to Apply Below
Location: New York
About therole
We'reseeking a seasoned engineer to design,operate, and scale our workflow orchestration platform with a primary focus on Apache Airflow.
You'llown the Airflow control plane and developer experience end-to-end—architecture, automation, security, observability, and reliability—while also evaluating andoperatingcomplementary schedulers where appropriate.
You'llbuild automation infrastructure and partner across data, trading, and engineering teams to deliver mission-critical pipelines at scale.
- Architect, deploy, andoperateproduction-grade Airflow on Kubernetes including all components and user application dependencies, with focus on upgrades, capacity planning, HA, security, and performance tuning
- Operate a multi-scheduler ecosystem:determine when to use Airflow, distributed compute schedulers, or lightweight task runners based on workload requirements; provide unified developer experience across schedulers
- Build automation infrastructure:
Terraform modules and Helm charts with
GitOps-driven CI/CD for environment provisioning, upgrades, and zero-downtime rollouts - Standardize the developer experience: DAG repo templates, shared operator libraries, connection and secrets management, dependency packaging, code ownership, linting, unit testing, and pre-commit hooks
- Implement comprehensive observability: metrics collection, dashboards, distributed tracing, SLA/latency monitoring, intelligent alerting, and runbook automation
- Enable resilient workflow patterns: build idempotency frameworks, retry/backoff strategies, deferrable operators and sensors, dynamic task mapping, and data-aware scheduling
- Ensure reliability at enterprise scale: architect and tune resource allocation (pools, queues, concurrency limits) to support high-throughput workloads;optimizelarge-scale backfill strategies; develop comprehensive runbooks and lead incident response/postmortems
- Partner with teams across the organization to provide enablement, documentation, and self-service tooling
- Mentor engineers, contribute to platform roadmap and technical standards, and drive engineering best practices
- 5–8+ years building/operating data or platform systems; 3+ years running Airflow in production at scale (hundreds–thousands of DAGs and high task throughput).
- Deep Airflowexpertise: DAG design and testing, idempotency, deferrable operators/sensors, dynamic task mapping, task groups, datasets, pools/queues, SLAs, retries/backfills, cross-DAG dependencies.
- Strong Kubernetes experience running Airflow and supporting services:
Helm, autoscaling, node/pod tuning, topology spread, network policies, PDBs, and blue/green or canary strategies. - Automation-first mindset:
Terraform, Helm,Git Ops(Argo CD/Flux), and CI/CD for platform lifecycle; policy-as-code (OPA/Gatekeeper/Conftest) for DAG, connection, andsecretschanges. - Proficiency in Python for authoring operators/hooks/utilities; solid Bash; familiarity with Go or Java is a plus.
- Observability and SRE practices:
Prometheus/Grafana/Stats
D, centralized logging, alert design, capacity/throughput modeling, performance tuning. - Data platform experience with at least one major cloud (AWS/Azure/GCP) and systems like Snowflake/Big Query/Redshift, Databricks/Spark, EMR/Dataproc; strong grasp of IAM, VPC networking, and storage (S3/GCS/ADLS).
- Security/compliance: SSO/OIDC, RBAC,secrets management (Vault/Secrets Manager), auditing, least-privilege connection management, and change control.
- Proven incident leadership, runbook creation, and platform roadmap execution; excellent cross-functional communication.
- Experience operating alternative orchestrators (Prefect 2.x,Dagster, Argo Workflows, AWS Step Functions) and leading migrations to/from Airflow.
- Open Lineage/Marquez adoption;
Great Expectations or other data quality frameworks; data contracts. - dbtCore/Cloud orchestration patterns (state management, artifacts, slim CI).
- Cost optimization and capacity planning for schedulers and workers; spot instance strategies.
- Multi-region HA/DR for Airflow metadata DB; backup/restore and disaster drills.
- Building internal developer platforms/portals (e.g., Backstage) for self-service pipelines.
- Contributions to Apache Airflow or provider packages; familiarity with recent AIPs/Airflow 2.7+ features.
Position Requirements
10+ Years
work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×