Home Base seeks to hire a data engineer to start Apr 2021, who will support a variety of database management, ETL scripting, and data validation tasks that include but are not limited to: querying databases, restructuring data, cleaning and validating data, performing manual ETL tasks, automating ETL tasks using tools and custom scripting, full pipeline management/monitoring, improving systems and processes, and documenting data systems. The qualified candidate will be highly detail-oriented and have a strong interest in and aptitude for data management and engineering. Some specific focus areas would be determined based on the candidate's skills and interests.
The successful candidate must be highly organized, motivated, and able to thrive in a fast-paced team environment and must enjoy the challenge of a dynamic environment with evolving needs. It is extremely important that the candidate possess the ability to carefully keep track of multiple work streams.
Relevant activities include, but are not limited to the following:
- Achieving an extremely detailed understanding of our current data ecosystem, including its structure, data meaning, history, flow/processing, and challenges
- Utilizing, improving, and constructing ETL tools
- Running current SQL, Python, PHP, and/or Tableau Prep ETL scripts
- Using various monitoring and evaluation methods to validate that data flowing through these pipelines is accurate and troubleshooting/addressing issues when they are discovered
- Improving and further integrating these scripts (ETL and validation) further into various data pipelines to achieve greater efficiency, reliability, and functionality.
- Constructing new ETL tools as necessary/able, including a major rewrite of a family of old PHP pipelines in Python
- Data Cleaning
- Writing queries and scripts to identify data quality problems
- Investigating the root cause of data quality problems
- Working with appropriate team members to determine appropriate data remediation and process improvement plans
- Developing queries and scripts as needed to repair data in bulk
- Supporting a dashboard that automatically monitors for certain critical data quality problems in production, independent of ETL processes
- Additional Responsibilities
- Support the team as needed with data querying, processing, analysis and reporting for both regular and ad-hoc requests from clinical, executive, and external audiences
- Research potential new data engineering solutions, analyze feasibility, and assist technical leadership in road-mapping the evolution of our data infrastructure
- Create and maintain documentation across our data ecosystem
- Degree in Health Informatics, Computer Science, Statistics, Mathematics, Engineering, or a similar field
- Familiarity with behavioral health clinical practice and/or research preferred
- Procedural programming for data manipulation using Python, NumPy, and Pandas
- PHP, Java, or other languages are a plus
- Knowledge of relational database platforms and data modeling
- Comfortable extracting data from and loading data into sources ranging from an Enterprise Data Warehouse to an Excel or text file, using built-in tools or custom-written ETL scripts
- Knowledge of data aggregation and transformation processes (e.g. pivot, merge, union, hierarchical grouping, aggregation functions)
- Above average SQL skills (e.g. familiar with subqueries, multiple joins, and grouping), specifically MySQL. SQL Server experience a plus
- Comfortable with complex multi-stage, multi-technology ETL pipelines
- Comfortable using APIs to transmit data in both an ad-hoc and automated manner
- Familiar with concepts/tools of Data Quality Management as well as Data Governance practices
- Ability to interpret and follow-through on data requirements and with strong attention to detail
- Strength in independently validating and debugging code and analyses, including consulting documentation, Stack Exchange, etc.
- Demonstrates personal initiative and time management skills, as well as the ability to work effectively and kindly as part of a team
- Excellent verbal and written communication skills
- Familiar with agile software development methodologies
- Interest in identifying process improvement opportunities is a plus
LICENSES, CERTIFICATIONS, and/or REGISTRATIONS: Specify minimum credentials and clearly indicate if required or preferred.
- Required: Undergraduate degree in Health Informatics, Computer Science, Statistics, Mathematics, Engineering, or a related subject.
- Preferred: Graduate degree in one of the above.
Preferred coursework would include most of the following:
- Intermediate Databases and SQL
- Intermediate Programming (Procedural and/or OO)
- Data Structures and Algorithms
- Data Quality Management
- Data Flow and Automation
- Agile Project Management
Equivalent Experience – Equivalent time and aptitude achieved through work experience may substitute for some of the preferred courses listed above.
EXPERIENCE: Indicate the required and preferred (optional) amount and type of experience.
Preferred: 2+ years of experience in data management in a healthcare/clinical setting, however recent or anticipated college graduates will be considered.
SUPERVISORY RESPONSIBILITY (authority to hire, promote, or terminate): Indicate supervisory “scope” and list the number of employees supervised.
FISCAL RESPONSIBILITY: Indicate financial “scope” information, e.g. size of budget, volume, revenue, etc.
WORKING CONDITIONS: Describe the conditions in which the work is performed. Use this section to detail any physical requirements for the position (lifting, carrying, etc). Use this section to also detail any environmental conditions associated with the position (outdoor weather requirements, hazardous materials, etc.).
100% remote through Aug 31, 2021; up to 100% remote afterwards, TBD.
Massachusetts General Hospital is an Equal Opportunity Employer. By embracing diverse skills, perspectives and ideas, we choose to lead. Applications from protected veterans and individuals with disabilities are strongly encouraged.