Note: This post was updated to include an example data file.
I thought it might be useful to follow up the last post with another one showing the same examples in R.
R provides a function called lm
, which is similar in spirit to NumPy’s linalg.lstsq
. As you’ll see, lm
’s interface is a bit more tuned to the concepts of modeling.
We begin by reading in the example CSV into a data frame:
responses <- read.csv('example_data.csv')
responses
respondent vanilla.love strawberry.love chocolate.love dog.love cat.love
1 Aylssa 9 4 9 9 9
2 Ben8 8 6 4 10 4
3 Cy 9 4 8 2 6
4 Eva 3 7 9 4 6
5 Lem 6 8 5 2 5
6 Louis 4 5 3 10 3
A data frame is sort of like a matrix, but with named columns. That is, we can refer to entire columns using the dollar sign. We are now ready to run least squares. We’ll create the model for predicting “dog love.” To create the “cat love” model, simply use that column name instead:
fit1 <- lm(
responses$dog.love ~ responses$vanilla.love +
responses$strawberry.love +
responses$chocolate.love
)
The syntax for lm is a little off-putting at first. This call tells it to create a model for “dog love” with respect to (the ~) a function of the form offset + x1 * vanilla love + x2 * strawberry love + x3 * chocolate love. Note that the offset is conveniently implied when using lm
, so this is the same as the second model we created in Python. Now that we’ve computed the coefficients for our “dog love” model, we can ask R about it:
summary(fit1)
Call:
lm(formula = responses$dog.love ~ responses$vanilla.love + responses$strawberry.love +
responses$chocolate.love)
Residuals:
1 2 3 4 5 6
3.1827 2.9436 -4.5820 0.8069 -1.9856 -0.3657
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.9298 15.0654 1.389 0.299
responses$vanilla.love -0.2783 0.9934 -0.280 0.806
responses$strawberry.love -1.4314 1.5905 -0.900 0.463
responses$chocolate.love -0.7647 0.8214 -0.931 0.450
Residual standard error: 4.718 on 2 degrees of freedom
Multiple R-squared: 0.4206, Adjusted R-squared: -0.4485
F-statistic: 0.484 on 3 and 2 DF, p-value: 0.7272
This gives us quite a bit of information, including the coefficients for our “dog love” model and various error metrics. You can find the offset and coefficients under the Estimate column above. We quickly verify this using R’s vectorized arithmetic:
20.9298 -
0.2783 * responses$vanilla.love -
1.4314 * responses$strawberry.love -
0.7647 * responses$chocolate.love
[1] 5.8172 7.0562 6.5819 3.1928 3.9853 10.3655
You’ll notice the model is essentially the same as the one we got from NumPy. Our next step is to add in the squared inputs. We do this by adding extra terms to the modeling formula. The I()
function allows us to easily add additional operators to columns. That’s how we accomplish the squaring. We could alternatively add squared input values to the data frame, but using I()
is more convenient and natural.
fit2 <- lm(responses$dog.love ~ responses$vanilla.love +
I(responses$vanilla.love^2) + responses$strawberry.love +
I(responses$strawberry.love^2) + responses$chocolate.love +
I(responses$chocolate.love^2))
summary(fit2)
Call:
lm(formula = responses$dog.love ~ responses$vanilla.love + I(responses$vanilla.love^2) +
responses$strawberry.love + I(responses$strawberry.love^2) +
responses$chocolate.love + I(responses$chocolate.love^2))
Residuals:
ALL 6 residuals are 0: no residual degrees of freedom!
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -357.444 NaN NaN NaN
responses$vanilla.love 72.444 NaN NaN NaN
I(responses$vanilla.love^2) -6.111 NaN NaN NaN
responses$strawberry.love 59.500 NaN NaN NaN
I(responses$strawberry.love^2) -5.722 NaN NaN NaN
responses$chocolate.love 7.000 NaN NaN NaN
I(responses$chocolate.love^2) NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 5 and 0 DF, p-value: NA
We can see that we get the same “dog love” model as produced by the third Python version of the last post. Again, we quickly verify that the output is the same (minus some rounding errors):
-357.444 +
72.444 * responses$vanilla.love -
6.111 * responses$vanilla.love^2 +
59.5 * responses$strawberry.love -
5.722 * responses$strawberry.love^2 +
7 * responses$chocolate.love
[1] 9.009 10.012 2.009 4.011 2.016 10.006