The charming adventures of an analyst and his solver.
Predict how much people like cats and dogs based on their ice cream preferences. Also, R.
I thought it might be useful to follow up the last post with another one showing the same examples in R.
R provides a function called lm, which is similar in spirit to numpy's linalg.lstsq. As you'll see, lm's interface is a bit more tuned to the concepts of modeling.
We begin by reading in the example CSV into a data frame:
> responses <- read.csv('example_data.csv') > responses respondent vanilla.love strawberry.love chocolate.love 1 Serdar 9 4 9 2 Dan 8 6 4 3 Nathaniel 9 4 8 4 Lauren 3 7 9 5 Jen 6 8 5 6 Jackie 4 5 3 dog.love cat.love 1 9 9 2 10 4 3 2 6 4 4 6 5 2 5 6 10 3
A data frame is sort of like a matrix, but with named columns. That is, we can refer to entire columns using the dollar sign. We are now ready to run least squares. We'll create the model for predicting "dog love." To create the "cat love" model, simply use that column name instead:
> fit1 <- lm(responses$dog.love ~ responses$vanilla.love + responses$strawberry.love + responses$chocolate.love)
The syntax for lm is a little offputting at first. This call tells it to create a model for "dog love" with respect to (the ~) a function of the form offset + x1*vanilla love + x2*strawberry love + x3*chocolate love. Note that the offset is conveniently implied when using lm, so this is the same as the second model we created in Python. Now that we've computed the coefficients for our "dog love" model, we can ask R about it:
> summary(fit1) Call: lm(formula = responses$dog.love ~ responses$vanilla.love + responses$strawberry.love + responses$chocolate.love) Residuals: 1 2 3 4 5 6 3.1827 2.9436 -4.5820 0.8069 -1.9856 -0.3657 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 20.9298 15.0654 1.389 0.299 responses$vanilla.love -0.2783 0.9934 -0.280 0.806 responses$strawberry.love -1.4314 1.5905 -0.900 0.463 responses$chocolate.love -0.7647 0.8214 -0.931 0.450 Residual standard error: 4.718 on 2 degrees of freedom Multiple R-squared: 0.4206, Adjusted R-squared: -0.4485 F-statistic: 0.484 on 3 and 2 DF, p-value: 0.7272
This gives us quite a bit of information, including the coefficients for our "dog love" model and various error metrics. You can find the offset and coefficients under the Estimate column above. We quickly verify this using R's vectorized arithmetic:
> 20.9298 - 0.2783 * responses$vanilla.love - 1.4314 * responses$strawberry.love - 0.7647 * responses$chocolate.love  5.8172 7.0562 6.5819 3.1928 3.9853 10.3655
You'll notice the model is essentially the same as the one we got from numpy. Our next step is to add in the squared inputs. We do this by adding extra terms to the modeling formula. The I() function allows us to easily add additional operators to columns. That's how we accomplish the squaring. We could alternatively add squared input values to the data frame, but using I() is more convenient and natural.
> fit2 <- lm(responses$dog.love ~ responses$vanilla.love + I(responses$vanilla.love^2) + responses$strawberry.love + I(responses$strawberry.love^2) + responses$chocolate.love + I(responses$chocolate.love^2)) > summary(fit2) Call: lm(formula = responses$dog.love ~ responses$vanilla.love + I(responses$vanilla.love^2) + responses$strawberry.love + I(responses$strawberry.love^2) + responses$chocolate.love + I(responses$chocolate.love^2)) Residuals: ALL 6 residuals are 0: no residual degrees of freedom! Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -357.444 NA NA NA responses$vanilla.love 72.444 NA NA NA I(responses$vanilla.love^2) -6.111 NA NA NA responses$strawberry.love 59.500 NA NA NA I(responses$strawberry.love^2) -5.722 NA NA NA responses$chocolate.love 7.000 NA NA NA I(responses$chocolate.love^2) NA NA NA NA Residual standard error: NaN on 0 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: NaN F-statistic: NaN on 5 and 0 DF, p-value: NA
We can see that we get the same "dog love" model as produced by the third Python version of the last post. Again, we quickly verify that the output is the same (minus some rounding errors):
> -357.444 + 72.444 * responses$vanilla.love - 6.111 * responses$vanilla.love^2 + 59.5 * responses$strawberry.love - 5.722 * responses$strawberry.love^2 + 7 * responses$chocolate.love  9.009 10.012 2.009 4.011 2.016 10.006