Associate Principal - Data Engineering Job Cincinnati area,Ohio USA,Software Development

Job Description

Senior Developer – PySpark / Python Data Engineering

Primary

Skills:

pysparkpython Developer

Location: India, Global Delivery Center Regional Hub

Industry: Multi National FMCG

Cloud Strategy: Hyperscaler First Azure GCP AWS with Databricks Delta Lake

Key Responsibilities

PySpark Development Primary Focus
- Design and develop production‑grade PySpark applications for large‑scale batch and streaming data processing.
- Implement advanced PySpark Data Frame API operations:
  - Complex transformations – Window functions, Pivot/Unpivot and nested struct handling.
  - Multidataset joins – Broadcast joins, Sort Merge joins and skew‑handling strategies.
  - Custom UDFs – User‑Defined Functions, Pandas UDFs, Vectorized UDFs for performance‑critical transformations.
  - Aggregations and Group By operations optimized for large FMCG datasets.
- Implement PySpark Structured Streaming for realtime data processing:
  - Streaming sources – Kafka, Azure Event Hubs, GCP Pub Sub.
  - Watermarking and windowing strategies for late‑arriving data.
  - Stateful streaming operations using mapGroups With State .
  - Exactly‑once and at‑least‑once delivery semantics.
- Apply advanced Spark performance tuning techniques:
  - Partition optimization – repartition vs coalesce strategies.
  - Handling data skew using salting and custom partitioners.
  - Broadcast variable management and accumulator usage.
  - Catalyst optimizer hints and AQE Adaptive Query Execution tuning.
  - Executor sizing, memory fractions and parallelism configuration.
- Develop and maintain reusable PySpark libraries for shared data processing capabilities.
Python Engineering Primary Focus
- Build Python‑based data services automation scripts and utility frameworks supporting the data platform.
- Develop REST API integrations using Python requests/httpx for consuming SAP OData, Salesforce and third‑party FMCG APIs.
- Implement data validation and reconciliation frameworks using Python Great Expectations, Pandera.
- Build Python‑based orchestration scripts and helper utilities for Airflow DAGs and Databricks Workflows.
- Apply software engineering best practices:
  - Unit testing with pytest and integration testing with Testcontainers.
  - Type hints, docstrings and modular design patterns.
  - Virtual environments, dependency management (Poetry/pip) and packaging.
- Implement Python‑based data quality checks for completeness, consistency and conformity validations.
Data Lakehouse Cloud Platform Primary Focus
- Build and manage Data Lakehouse architectures on hyperscaler platforms (Azure Databricks, GCP Dataproc, AWS EMR).
- Utilize Delta Lake, Apache Iceberg, Apache Hudi for ACID‑compliant data lake storage.
- Implement Medallion Architecture – Bronze/Silver/Gold for progressive data refinement.
- Use ACID transactions, schema enforcement, time travel, Optimize and ZOrder, Change Data Feed (CDF) for incremental data propagation.
- Manage Databricks Workflows and Job Clusters for production pipeline execution.
- Implement Databricks Auto Loader for incremental scalable data ingestion.
- Utilize Unity Catalog for data governance, lineage and access control.
Data Ingestion Integration
- Build data ingestion pipelines from diverse FMCG data sources.
- Sources include SAP S/4
  
  HANA OData APIs, BAPI extracts, IDoc feeds;
  Salesforce REST API, Bulk API, Platform Events;
  Operational Databases – Oracle Cloud, SQL Azure, Cloud Spanner;
  Streaming sources – Apache Kafka, Azure Event Hubs, GCP Pub Sub;
  File‑based sources – SFTP, Azure Blob, GCS, S3, CSV, Parquet, Avro, JSON.
- Implement Change Data Capture (CDC) patterns for realtime database synchronization.
- Design schema evolution strategies to handle upstream data model changes gracefully.
- Publish processed data to downstream consumers – Big Query, Azure Synapse, Snowflake, Power BI, Looker, Feature Stores (Feast, Databricks).
SQL Data Modeling
- Write and optimize complex SQL queries for data extraction, transformation and validation.
- Design data warehouse schemas – Star and Snowflake models for FMCG analytics domains.
- Implement Spark SQL for large‑scale analytical query processing.
- Develop data quality SQL checks and reconciliation frameworks.
- Optimize SQL performance – Query plans, partition pruning, predicate pushdown.

Benefits & Perks

Comprehensive…