List-columns in data.table: Reducing the cognitive & computational burden of complex data

The use of list-columns in data frames and tibbles is well documented (e.g. Bryan, 2018), providing a cognitively efficient way to organize results of complex data (e.g. several statistical models...

List-columns in data.table: Reducing the cognitive & computational burden of complex data

January 31, 2020

The use of list-columns in data frames and tibbles is well documented (e.g. Bryan, 2018), providing a cognitively efficient way to organize results of complex data (e.g. several statistical models, groupings of text, data summaries, or even graphics) with corresponding data. For example, one can store student information within classrooms, player information within teams, or analyses within groups. This allows the data to be of variable sizes without overly complicating or adding redundancies to the structure of the data. In turn, this can improve the reliability to appropriately analyze the data. Because of its efficiency and speed, being able to use data.table to work with list-columns would be beneficial in many data contexts (e.g. to reduce memory usage in large data sets). Herein, I demonstrate how one can create list-columns in a data table using the by argument in data.table and purrr::map(). I compare the behavior of the data.table approaches to the dplyr::group_nest() function and tidyr::unnest(), two of the several powerful Tidyverse nesting and unnesting functions. Results using bench::mark() show the speed and efficiency of using data.table to work with list-columns.

About the speaker

I am a Research Assistant Professor at Utah State University in clinical and social data science. My work addresses the research data analytic needs of the college. My emphasis is in the use of R to work with complex data sets. In addition to my own research, I consult on data science issues for the college (including public health, education, psychology, and other social sciences). In addition to the research and consulting aspects of my position, I also teach statistics and research methods courses. However, my favorite course to teach is an Intro to R course and an Intermediate R course for undergraduate and graduate students.