🧐 Data Fitting 2a - Very, Very Simple Linear Regression in R

Note: This post was updated to include an example data file.

I thought it might be useful to follow up the last post with another one showing the same examples in R.

R provides a function called lm, which is similar in spirit to NumPy’s linalg.lstsq. As you’ll see, lm’s interface is a bit more tuned to the concepts of modeling.

We begin by reading in the example CSV into a data frame:

responses <- read.csv('example_data.csv')
responses

  respondent vanilla.love strawberry.love chocolate.love dog.love cat.love
1     Aylssa            9               4              9        9        9
2       Ben8            8               6              4       10        4
3         Cy            9               4              8        2        6
4        Eva            3               7              9        4        6
5        Lem            6               8              5        2        5
6      Louis            4               5              3       10        3

A data frame is sort of like a matrix, but with named columns. That is, we can refer to entire columns using the dollar sign. We are now ready to run least squares. We’ll create the model for predicting “dog love.” To create the “cat love” model, simply use that column name instead:

fit1 <- lm(
  responses$dog.love ~ responses$vanilla.love +
                       responses$strawberry.love + 
                       responses$chocolate.love
)

The syntax for lm is a little off-putting at first. This call tells it to create a model for “dog love” with respect to (the ~) a function of the form offset + x1 * vanilla love + x2 * strawberry love + x3 * chocolate love. Note that the offset is conveniently implied when using lm, so this is the same as the second model we created in Python. Now that we’ve computed the coefficients for our “dog love” model, we can ask R about it:

summary(fit1)

Call:
lm(formula = responses$dog.love ~ responses$vanilla.love + responses$strawberry.love + 
    responses$chocolate.love)

Residuals:
      1       2       3       4       5       6 
 3.1827  2.9436 -4.5820  0.8069 -1.9856 -0.3657 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)
(Intercept)                20.9298    15.0654   1.389    0.299
responses$vanilla.love     -0.2783     0.9934  -0.280    0.806
responses$strawberry.love  -1.4314     1.5905  -0.900    0.463
responses$chocolate.love   -0.7647     0.8214  -0.931    0.450

Residual standard error: 4.718 on 2 degrees of freedom
Multiple R-squared:  0.4206, Adjusted R-squared:  -0.4485 
F-statistic: 0.484 on 3 and 2 DF,  p-value: 0.7272

This gives us quite a bit of information, including the coefficients for our “dog love” model and various error metrics. You can find the offset and coefficients under the Estimate column above. We quickly verify this using R’s vectorized arithmetic:

20.9298 - 
  0.2783 * responses$vanilla.love -
  1.4314 * responses$strawberry.love -
  0.7647 * responses$chocolate.love

[1]  5.8172  7.0562  6.5819  3.1928  3.9853 10.3655

You’ll notice the model is essentially the same as the one we got from NumPy. Our next step is to add in the squared inputs. We do this by adding extra terms to the modeling formula. The I() function allows us to easily add additional operators to columns. That’s how we accomplish the squaring. We could alternatively add squared input values to the data frame, but using I() is more convenient and natural.

fit2 <- lm(responses$dog.love ~ responses$vanilla.love +
  I(responses$vanilla.love^2) + responses$strawberry.love +
  I(responses$strawberry.love^2) + responses$chocolate.love +
  I(responses$chocolate.love^2))

summary(fit2)

Call:
lm(formula = responses$dog.love ~ responses$vanilla.love + I(responses$vanilla.love^2) + 
    responses$strawberry.love + I(responses$strawberry.love^2) + 
    responses$chocolate.love + I(responses$chocolate.love^2))

Residuals:
ALL 6 residuals are 0: no residual degrees of freedom!

Coefficients: (1 not defined because of singularities)
                               Estimate Std. Error t value Pr(>|t|)
(Intercept)                    -357.444        NaN     NaN      NaN
responses$vanilla.love           72.444        NaN     NaN      NaN
I(responses$vanilla.love^2)      -6.111        NaN     NaN      NaN
responses$strawberry.love        59.500        NaN     NaN      NaN
I(responses$strawberry.love^2)   -5.722        NaN     NaN      NaN
responses$chocolate.love          7.000        NaN     NaN      NaN
I(responses$chocolate.love^2)        NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 5 and 0 DF,  p-value: NA

We can see that we get the same “dog love” model as produced by the third Python version of the last post. Again, we quickly verify that the output is the same (minus some rounding errors):

-357.444 + 
  72.444 * responses$vanilla.love -
  6.111 * responses$vanilla.love^2 +
  59.5 * responses$strawberry.love -
  5.722 * responses$strawberry.love^2 +
  7 * responses$chocolate.love

[1]  9.009 10.012  2.009  4.011  2.016 10.006