adventures in optimization

The charming adventures of an analyst and his solver.

« Wednesday, February 16, 2011 »

Data Fitting Part 2a - Very, Very Simple Linear Regression in R

Predict how much people like cats and dogs based on their ice cream preferences. Also, R.

I thought it might be useful to follow up the last post with another one showing the same examples in R.

R provides a function called lm, which is similar in spirit to numpy's linalg.lstsq. As you'll see, lm's interface is a bit more tuned to the concepts of modeling.

We begin by reading in the example CSV into a data frame:

> responses <- read.csv('example_data.csv')

> responses
  respondent vanilla.love strawberry.love chocolate.love
1     Serdar            9               4              9
2        Dan            8               6              4
3  Nathaniel            9               4              8
4     Lauren            3               7              9
5        Jen            6               8              5
6     Jackie            4               5              3

  dog.love cat.love
1        9        9
2       10        4
3        2        6
4        4        6
5        2        5
6       10        3

A data frame is sort of like a matrix, but with named columns. That is, we can refer to entire columns using the dollar sign. We are now ready to run least squares. We'll create the model for predicting "dog love." To create the "cat love" model, simply use that column name instead:

> fit1 <- lm(responses$dog.love ~ responses$vanilla.love
    + responses$strawberry.love + responses$chocolate.love)

The syntax for lm is a little offputting at first. This call tells it to create a model for "dog love" with respect to (the ~) a function of the form offset + x1*vanilla love + x2*strawberry love + x3*chocolate love. Note that the offset is conveniently implied when using lm, so this is the same as the second model we created in Python. Now that we've computed the coefficients for our "dog love" model, we can ask R about it:

> summary(fit1)

Call:
lm(formula = responses$dog.love ~ responses$vanilla.love
  + responses$strawberry.love + responses$chocolate.love)

Residuals:
      1       2       3       4       5       6
 3.1827  2.9436 -4.5820  0.8069 -1.9856 -0.3657

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)
(Intercept)                20.9298    15.0654   1.389    0.299
responses$vanilla.love     -0.2783     0.9934  -0.280    0.806
responses$strawberry.love  -1.4314     1.5905  -0.900    0.463
responses$chocolate.love   -0.7647     0.8214  -0.931    0.450

Residual standard error: 4.718 on 2 degrees of freedom
Multiple R-squared: 0.4206,     Adjusted R-squared: -0.4485
F-statistic: 0.484 on 3 and 2 DF,  p-value: 0.7272

This gives us quite a bit of information, including the coefficients for our "dog love" model and various error metrics. You can find the offset and coefficients under the Estimate column above. We quickly verify this using R's vectorized arithmetic:

> 20.9298 - 0.2783 * responses$vanilla.love
    - 1.4314 * responses$strawberry.love
    - 0.7647 * responses$chocolate.love

[1]  5.8172  7.0562  6.5819  3.1928  3.9853 10.3655

You'll notice the model is essentially the same as the one we got from numpy. Our next step is to add in the squared inputs. We do this by adding extra terms to the modeling formula. The I() function allows us to easily add additional operators to columns. That's how we accomplish the squaring. We could alternatively add squared input values to the data frame, but using I() is more convenient and natural.

> fit2 <- lm(responses$dog.love ~ responses$vanilla.love
    + I(responses$vanilla.love^2) + responses$strawberry.love
    + I(responses$strawberry.love^2) + responses$chocolate.love
    + I(responses$chocolate.love^2))

> summary(fit2)

Call:
lm(formula = responses$dog.love ~ responses$vanilla.love
  + I(responses$vanilla.love^2) + responses$strawberry.love
  + I(responses$strawberry.love^2) + responses$chocolate.love
  + I(responses$chocolate.love^2))

Residuals:
ALL 6 residuals are 0: no residual degrees of freedom!

Coefficients: (1 not defined because of singularities)
                               Estimate Std. Error t value Pr(>|t|)
(Intercept)                    -357.444         NA      NA       NA
responses$vanilla.love           72.444         NA      NA       NA
I(responses$vanilla.love^2)      -6.111         NA      NA       NA
responses$strawberry.love        59.500         NA      NA       NA
I(responses$strawberry.love^2)   -5.722         NA      NA       NA
responses$chocolate.love          7.000         NA      NA       NA
I(responses$chocolate.love^2)        NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:     1,      Adjusted R-squared:   NaN
F-statistic:   NaN on 5 and 0 DF,  p-value: NA

We can see that we get the same "dog love" model as produced by the third Python version of the last post. Again, we quickly verify that the output is the same (minus some rounding errors):

> -357.444 + 72.444 * responses$vanilla.love
    - 6.111 * responses$vanilla.love^2
    + 59.5 * responses$strawberry.love
    - 5.722 * responses$strawberry.love^2
    + 7 * responses$chocolate.love

[1]  9.009 10.012  2.009  4.011  2.016 10.006