Categorical Embeddings: New Ways to Simplify Complex Data

rstudio::global 2021 modeling

Alan Feder

January 21, 2021

When building a predictive model in R, many of the functions (such as lm(), glm(), randomForest, xgboost, or neural networks in keras) require that all input variables are numeric. If your data has categorical variables, you may have to choose between ignoring some of your data and too many new columns.

Categorical embeddings are a relative new method, utilizing methods popularized in Natural Language Processing that help models solve this problem and can help you understand more about the categories themselves.

While there are a number of online tutorials on how to use Keras (usually in Python) to create these embeddings, this talk will use embed::step_embed(), an extension of the recipes package, to create the embeddings.

View Materials

Additional Videos

Alan Feder, Grant Fleming, Simon Couch, Chelsea Parlett and Richard Vogg Q&A

Alan Feder, Simon Couch, and Richard Vogg Q&A

About the speaker

Alan Feder

Alan Feder is a Principal Data Scientist at Invesco, where he uses as much R as possible to solve problems and build products throughout the company. Previously, he worked as a data scientist at AIG and an actuary at Swiss Re. He studied statistics and mathematics at Columbia University. He is unreasonably excited to spread the word about categorical embeddings. Alan lives in New York City with his wife, Ashira, and two children, Matan and Sarit.