Big Data Engineer Job Jersey City area,New Jersey USA,IT/Tech

Mandatory

Skills:

Apache Spark, Hive, Kafka, Amazon Glue, Google Dataflow, Talend MDM, Hadoop, Presto, Strong experience with MySQL, Postgre

SQL, Mongo

DB, Cassandra.

Role: Big Data Engineer

Job Overview:
We're seeking a highly skilled Data Engineer, Big Data Engineer to build scalable data pipelines, develop ML models, and integrate big data systems. You'll work with structured, semi-structured, and unstructured data, focusing on optimizing data systems, building ETL pipelines, and deploying AI models in cloud environments.

Key Responsibilities:

Data Ingestion: Build scalable ETL pipelines using Apache Spark, Talend, AWS Glue, Google Dataflow, Apache NiFi. Ingest data from APIs, file systems, and databases.

Data Transformation/Validation: Use Pandas, Apache Beam, and Dask for data cleaning, transformation, and validation. Automate data quality checks with Pytest, Unittest.

Big Data Systems: Process large datasets with Hadoop, Kafka, Apache Flink, Apache Hive. Stream real-time data using Kafka, Google Cloud Pub Sub.

Task Queues: Manage asynchronous processing with Celery, RQ, Rabbit

MQ, or Kafka. Implement retry mechanisms and track task status.

Scalability: Optimize for performance with distributed processing (Spark, Flink), parallelization (joblib), and data partitioning.

Cloud Storage: Work with AWS, Azure, GCP, Databricks. Store and manage data with S3, Big Query, Redshift, Synapse Analytics, and HDFS.

Required Skills:

ETL Data Processing: Expertise in Apache Spark, AWS Glue, Google Dataflow, Talend.

Big Data Tools: Proficient with Hadoop, Kafka, Apache Flink, Hive, Presto.

Databases: Strong experience with MySQL, Postgre

SQL, Mongo

DB, Cassandra.

Machine Learning: Hands-on with Tensor Flow, PyTorch, Scikit-learn, XGBoost.

Cloud Platforms: Experience with AWS, Azure, GCP, Databricks.

Task Management: Familiar with Celery, RQ, Rabbit

MQ, Kafka.

Version Control: Git for source code management.

Desirable

Skills:

Real-time Data Processing: Experience with Apache Pulsar, Google Cloud Pub Sub.

Data Warehousing: Familiarity with Redshift, Big Query, Synapse Analytics.

Scalability Optimization: Knowledge of load balancing (NGINX, HAProxy) and parallel processing.

Data Governance: Use of MLflow, DVC, or other tools for model and data versioning.

Tools Technologies:

ETL: Apache Spark, Talend, AWS Glue, Google Dataflow.

Big Data: Hadoop, Kafka, Apache Flink, Presto.

Databases: MySQL, Postgre

SQL, Mongo

DB, Cassandra.

Cloud: AWS, GCP, Azure, Databricks.

Storage: S3, Big Query, Redshift, Synapse Analytics, HDFS.

Version Control: Git.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language