Our client is a leader in the AgTech space advancing farming globally through the use of cutting edge tech. They are looking to add a Principal Data Engineer to their team:
Evaluate: Understand the functions and features of existing services and finding the areas of improvement in terms of scalability & performance fine-tuning.
Build: Based on the findings, build a solution targeted towards showcasing the performance benefits with the new architecture including the optimized way to use S3 as a data storage and for data retrieval. Pipeline should be established as a workflow using terraform to make it reproducible for new applications.
Manage: Monitor life-cycle of active AWS services like lambda, EC2, S3 etc.
2. Data Processing
Define common data standards for consumption (maybe even upload) across applications: For the scalable data processing, design & build a solution having capability to consume data from various devices on edge & store in S3. This solution should include data standardization, where it can accept the data in any format and convert it into standardized format to denominate the consumption impact by AI applications for training.
Raw data to cleaned/processed data consumable directly by AI engineers: This will involve one time data cleaning of data leading back to 2018, developing scripts to fix duplicate labeling class names, deleting black/green/corrupt frames, fixing naming convention of images prior to 2020 to conform with current image format and also establishing an automated pipeline to do similar data checks on incoming data from the field.
3. Manage Data Storage, Retrieval/Curation
Build/Manage Database for raw images as well as labeled images for several applications
Tool-chain for Querying and visualizing data diversity