Creating and Preprocessing a Design Matrix with Recipes

June 8, 2017 Max Kuhn

Download Materials

R has an excellent framework for specifying models using formulas. While elegant and useful, it was designed in a time when models had small numbers of terms and complex preprocessing of data was not commonplace. As such, it has some limitations. In this talk, a new package called recipes is shown where the specification of model terms and preprocessing steps can be enumerated sequentially. The recipe can be estimated and applied to any dataset. Current options include simple transformations (log, Box-Cox, interactions, dummy variables, …), signal extraction (PCA, ICA, MDS), basis functions (splines, polynomials), imputation methods, and others.

About the Author

Max Kuhn

Max Kuhn is a software engineer at RStudio. He is currently working on improving R's modeling capabilities. He was a Director of Nonclinical Statistics at Pfizer Global R&D in Connecticut. He was applying models in the pharmaceutical and diagnostic industries for over 18 years. Max has a Ph.D. in Biostatistics.

Max is the author of numerous R packages for techniques in machine learning and reproducible research and is an Associate Editor for the Journal of Statistical Software. He, and Kjell Johnson, wrote the book Applied Predictive Modeling, which won the Ziegel award from the American Statistical Association, which recognizes the best book reviewed in Technometrics in 2015. Their latest book, Feature Engineering and Selection, was published in 2019.

Follow on Twitter More Content by Max Kuhn
Previous Video
Thinking inside the box: you can do that inside a data frame?!
Thinking inside the box: you can do that inside a data frame?!

The data frame is a crucial data structure in R and, especially, in the tidyverse. Working on a column or a...

Next Video
Collaboration and time travel- version control with git, github and RStudio
Collaboration and time travel- version control with git, github and RStudio

Hadley Wickham presents and demonstrates how understanding git & github will give you two data science supe...