I am using partial least squares regression (PLS) to model the relative effects of soil and weather variables on the magnitude of an annual phenomenon, nitrous oxide emissions. I am doing this on an annual basis across many sites.
Soils at each site are different, but soils within each site stay the same every year.
Weather is different at every site and different from year to year.
So, I have 56 responses (nitrous oxide), 56 corresponding weather predictors (one for each year), but only 10 soil predictors (one for each soil type).
I am using the
plspackage in R. I think I did an okay job pre-processing my data (BoxCox, center, scale). I end up with 17 columns of soil properties (the properties repeated for each year, not unique rows), 5 columns of weather variables (each row unique) and my dependent variable nitrous oxide (each row unique).
I just need one line of code:
plsFit<-plsr(nitrous_oxide ~ ., validation = "LOO", data=all_data)
Everything runs great, but I can't help but feel the lack of balance.
Should I really be treating soil and weather variables in the same way?
Applied Predictive Modeling has been extremely helpful in trying to do what I want to do, but boy do I have a lot of questions specific to my dataset.