Simplified Data Quality Monitoring of Dynamic Longitudinal Data: A Functional Programming Approach

rstudio::conf 2020 programming

Jacqueline Gutman

January 30, 2020

Ensuring the quality of data we deliver to customers or provide as inputs to models is often one of the most under-appreciated and yet time-consuming responsibilities of a modern data scientist. This task is challenging enough when working with static data, but when we have access to dynamic, longitudinal, continuously updating data, that complexity can become an asset. We will demonstrate how to to simplify data quality monitoring of dynamic data with a functional programming approach that enables early and actionable detection of data quality concerns.

Using purrr as well as tidyr and nested tibbles, we will illustrate the five key pillars of enjoyable, user-friendly data quality monitoring with relevant R code: Readability, Reproducibility, Efficiency, Robustness, and Compositionality.

Readability: FP empowers us to abstract away from the mechanics and implementation of comparing two or more related datasets and move towards declaring the intent of features and metrics we want to compare.

Reproducibility: By avoiding side-effects and dependencies on external states and inputs, and using functional units which can be easily tested over a variety of inputs, FP reduces the burden to create reproducible code. Perhaps more importantly, FP supports not just reproducibility of results, but reproducibility of workflows that can be continually applied to dynamic datasets.

Efficiency: FP enables more efficient code through lazy evaluation, caching, and simplifying implementation over parallel backends.

Robustness: FP allows greater testability of our code through modularization and elegant error-handling, with customized fail-safes for data that differs in expected ways over time.

Compositionality: FP encourages higher-level reasoning with functions, which in turn drives both readability--through higher-level, more abstract code--and robustness, through modifying function behavior in case errors are encountered.

About the speaker

Jacqueline Gutman

Jacqueline Gutman is a data scientist on the Quantitative Sciences team at Flatiron Health, supporting cancer research and improving patient care by working with electronic health record data and building accessible data science tools for reproducible research. Prior to joining Flatiron, she worked as a data scientist at Plated and at NYU School of Medicine, developing machine learning pipelines and researching ways to improve educational outcomes in the health professions. She holds a Master’s in Data Science from New York University.