We are looking for expertise in Realtime data warehousing and building large-scale Batch & Streaming data processing systems by using the latest data stack.
The Data Engineer takes responsibility for building and running data pipelines, designing our local data warehouse/lake and data frameworks, and catering to different data presentation techniques.
Responsibilities
- Coder at heart - Someone who can define, write, deploy, and maintain codebase independently.
- Define, Execute, and Manage large-scale ETL processes to build Datalake, Data warehouse, and support development.
- Strong Knowledge of building real-time data Analytics Pipelines using Kafka, Druid, and Airflow.
- Should have strong problem-solving capabilities and the ability to quickly propose feasible solutions and effectively communicate strategy and risk mitigation approaches to leadership.
- Build ETL pipelines in Spark, and Presto that process transaction and account-level data and standardize data fields across various data sources.
- Experience creating/supporting production software/systems and a proven track record of identifying and resolving performance bottlenecks for production systems.
- Exposure to deploying large data pipelines to scale ML/AI models built by the data science teams and experience with the development of models is a strong plus.
- Build and maintain high-performing ETL processes, including data quality and testing aligned across technology, internal reporting, and other functional teams.
- Create data dictionaries, set/monitor data validation alerts, and execute periodic jobs like performance dashboards, and predictive models scoring for client's deliverables.
- Define and build technical/data documentation and experience with code version control systems (e. g. git).
- Ensure data accuracy, integrity, and consistency.
Requirements
- Strong understanding of development and implementation aspects of data pipelines for ML/AI, especially on billion-scale datasets.
- Ability to take small-scale developed models as input and implement with requisite configuration and customization while maintaining model performance.
- Strong written, verbal, and interpersonal skills are needed to effectively communicate technical insights and recommendations with business customers and the leadership team.
- Exposure to model management and governance practices.
- Ability to make decisions around model drift to monitor and refine models continuously.
- 4+ yrs. work experience with a Bachelor's Degree or 3+ years of work experience with a master's or Advanced Degree in an analytical field such as computer science, statistics, finance, economics, or a relevant area.
- Strong experience in creating large-scale data engineering frameworks/pipelines, data-based decision-making, and quantitative analysis.
- Strong Experience with Batch & real-time data management
- Advanced experience in writing and optimizing efficient SQL queries and handling Large Data Sets in Big-Data Environments.
- Experience with Shell and Python scripting and exposure to Scheduling tools like Apache Airflow.
- Advanced knowledge of Big Data ecosystems and associated technologies, (e. g. Apache Spark, EMR, EKS, Redshift, Athena/Presto, Kafka, Airflow, etc. ) is a must.
- Languages preferred - Python, Scala.
Additional/Preferred Qualifications
- Experience with building Datalake/Lakehouse from scratch will be preferred.
- Experience with maintaining & optimizing infra for BI tools like Superset/Metabase is a plus.
- Experience with developing & maintaining backend REST APIs for Data platforms using Java/Golang is a strong plus.
- Experience with Semantic Layer around Data platform is a strong plus.
Some Important Traits - We look out for a Person in this role
- Independent, resourceful, analytical, and able to solve problems effectively.
- Ability to be flexible, agile, and thrive in chaos.
- Excellent oral and written communication skills.
This job was posted by Nirvesh Mehrotra from GoKwik.