×
Register Here to Apply for Jobs or Post Jobs. X
More jobs:

Associate Principal - Data Engineering

Job in Cincinnati, Hamilton County, Ohio, 45208, USA
Listing for: LTM
Full Time position
Listed on 2026-05-22
Job specializations:
  • Software Development
    Data Engineer
Salary/Wage Range or Industry Benchmark: 60000 - 80000 USD Yearly USD 60000.00 80000.00 YEAR
Job Description & How to Apply Below

Job Description

Senior Developer – PySpark / Python Data Engineering

Primary

Skills:

pysparkpython Developer

Location: India, Global Delivery Center Regional Hub

Industry: Multi National FMCG

Cloud Strategy: Hyperscaler First Azure GCP AWS with Databricks Delta Lake

Key Responsibilities
  • PySpark Development Primary Focus
    • Design and develop production‑grade PySpark applications for large‑scale batch and streaming data processing.
    • Implement advanced PySpark Data Frame API operations:
      • Complex transformations – Window functions, Pivot/Unpivot and nested struct handling.
      • Multidataset joins – Broadcast joins, Sort Merge joins and skew‑handling strategies.
      • Custom UDFs – User‑Defined Functions, Pandas UDFs, Vectorized UDFs for performance‑critical transformations.
      • Aggregations and Group By operations optimized for large FMCG datasets.
    • Implement PySpark Structured Streaming for realtime data processing:
      • Streaming sources – Kafka, Azure Event Hubs, GCP Pub Sub.
      • Watermarking and windowing strategies for late‑arriving data.
      • Stateful streaming operations using mapGroups With State .
      • Exactly‑once and at‑least‑once delivery semantics.
    • Apply advanced Spark performance tuning techniques:
      • Partition optimization – repartition vs coalesce strategies.
      • Handling data skew using salting and custom partitioners.
      • Broadcast variable management and accumulator usage.
      • Catalyst optimizer hints and AQE Adaptive Query Execution tuning.
      • Executor sizing, memory fractions and parallelism configuration.
    • Develop and maintain reusable PySpark libraries for shared data processing capabilities.
  • Python Engineering Primary Focus
    • Build Python‑based data services automation scripts and utility frameworks supporting the data platform.
    • Develop REST API integrations using Python requests/httpx for consuming SAP OData, Salesforce and third‑party FMCG APIs.
    • Implement data validation and reconciliation frameworks using Python Great Expectations, Pandera.
    • Build Python‑based orchestration scripts and helper utilities for Airflow DAGs and Databricks Workflows.
    • Apply software engineering best practices:
      • Unit testing with pytest and integration testing with Testcontainers.
      • Type hints, docstrings and modular design patterns.
      • Virtual environments, dependency management (Poetry/pip) and packaging.
    • Implement Python‑based data quality checks for completeness, consistency and conformity validations.
  • Data Lakehouse Cloud Platform Primary Focus
    • Build and manage Data Lakehouse architectures on hyperscaler platforms (Azure Databricks, GCP Dataproc, AWS EMR).
    • Utilize Delta Lake, Apache Iceberg, Apache Hudi for ACID‑compliant data lake storage.
    • Implement Medallion Architecture – Bronze/Silver/Gold for progressive data refinement.
    • Use ACID transactions, schema enforcement, time travel, Optimize and ZOrder, Change Data Feed (CDF) for incremental data propagation.
    • Manage Databricks Workflows and Job Clusters for production pipeline execution.
    • Implement Databricks Auto Loader for incremental scalable data ingestion.
    • Utilize Unity Catalog for data governance, lineage and access control.
  • Data Ingestion Integration
    • Build data ingestion pipelines from diverse FMCG data sources.
    • Sources include SAP S/4

      HANA OData APIs, BAPI extracts, IDoc feeds;
      Salesforce REST API, Bulk API, Platform Events;
      Operational Databases – Oracle Cloud, SQL Azure, Cloud Spanner;
      Streaming sources – Apache Kafka, Azure Event Hubs, GCP Pub Sub;
      File‑based sources – SFTP, Azure Blob, GCS, S3, CSV, Parquet, Avro, JSON.
    • Implement Change Data Capture (CDC) patterns for realtime database synchronization.
    • Design schema evolution strategies to handle upstream data model changes gracefully.
    • Publish processed data to downstream consumers – Big Query, Azure Synapse, Snowflake, Power BI, Looker, Feature Stores (Feast, Databricks).
  • SQL Data Modeling
    • Write and optimize complex SQL queries for data extraction, transformation and validation.
    • Design data warehouse schemas – Star and Snowflake models for FMCG analytics domains.
    • Implement Spark SQL for large‑scale analytical query processing.
    • Develop data quality SQL checks and reconciliation frameworks.
    • Optimize SQL performance – Query plans, partition pruning, predicate pushdown.
Benefits & Perks
  • Comprehensive…
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary