The charming adventures of an analyst and his solver.

« Wednesday, February 16, 2011 »

## Data Fitting Part 2a - Very, Very Simple Linear Regression in R

Predict how much people like cats and dogs based on their ice cream preferences. Also, R.

I thought it might be useful to follow up the last post with another one showing the same examples in R.

R provides a function called lm, which is similar in spirit to numpy's linalg.lstsq. As you'll see, lm's interface is a bit more tuned to the concepts of modeling.

We begin by reading in the example CSV into a data frame:

> responses <- read.csv('example_data.csv')

> responses
respondent vanilla.love strawberry.love chocolate.love
1     Serdar            9               4              9
2        Dan            8               6              4
3  Nathaniel            9               4              8
4     Lauren            3               7              9
5        Jen            6               8              5
6     Jackie            4               5              3

dog.love cat.love
1        9        9
2       10        4
3        2        6
4        4        6
5        2        5
6       10        3


A data frame is sort of like a matrix, but with named columns. That is, we can refer to entire columns using the dollar sign. We are now ready to run least squares. We'll create the model for predicting "dog love." To create the "cat love" model, simply use that column name instead:

> fit1 <- lm(responses$dog.love ~ responses$vanilla.love
+ responses$strawberry.love + responses$chocolate.love)


The syntax for lm is a little offputting at first. This call tells it to create a model for "dog love" with respect to (the ~) a function of the form offset + x1*vanilla love + x2*strawberry love + x3*chocolate love. Note that the offset is conveniently implied when using lm, so this is the same as the second model we created in Python. Now that we've computed the coefficients for our "dog love" model, we can ask R about it:

> summary(fit1)

Call:
lm(formula = responses$dog.love ~ responses$vanilla.love
+ responses$strawberry.love + responses$chocolate.love)

Residuals:
1       2       3       4       5       6
3.1827  2.9436 -4.5820  0.8069 -1.9856 -0.3657

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)                20.9298    15.0654   1.389    0.299
responses$vanilla.love -0.2783 0.9934 -0.280 0.806 responses$strawberry.love  -1.4314     1.5905  -0.900    0.463
responses$chocolate.love -0.7647 0.8214 -0.931 0.450 Residual standard error: 4.718 on 2 degrees of freedom Multiple R-squared: 0.4206, Adjusted R-squared: -0.4485 F-statistic: 0.484 on 3 and 2 DF, p-value: 0.7272  This gives us quite a bit of information, including the coefficients for our "dog love" model and various error metrics. You can find the offset and coefficients under the Estimate column above. We quickly verify this using R's vectorized arithmetic: > 20.9298 - 0.2783 * responses$vanilla.love
- 1.4314 * responses$strawberry.love - 0.7647 * responses$chocolate.love

[1]  5.8172  7.0562  6.5819  3.1928  3.9853 10.3655


You'll notice the model is essentially the same as the one we got from numpy. Our next step is to add in the squared inputs. We do this by adding extra terms to the modeling formula. The I() function allows us to easily add additional operators to columns. That's how we accomplish the squaring. We could alternatively add squared input values to the data frame, but using I() is more convenient and natural.

> fit2 <- lm(responses$dog.love ~ responses$vanilla.love
+ I(responses$vanilla.love^2) + responses$strawberry.love
+ I(responses$strawberry.love^2) + responses$chocolate.love
+ I(responses$chocolate.love^2)) > summary(fit2) Call: lm(formula = responses$dog.love ~ responses$vanilla.love + I(responses$vanilla.love^2) + responses$strawberry.love + I(responses$strawberry.love^2) + responses$chocolate.love + I(responses$chocolate.love^2))

Residuals:
ALL 6 residuals are 0: no residual degrees of freedom!

Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept)                    -357.444         NA      NA       NA
responses$vanilla.love 72.444 NA NA NA I(responses$vanilla.love^2)      -6.111         NA      NA       NA
responses$strawberry.love 59.500 NA NA NA I(responses$strawberry.love^2)   -5.722         NA      NA       NA
responses$chocolate.love 7.000 NA NA NA I(responses$chocolate.love^2)        NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:     1,      Adjusted R-squared:   NaN
F-statistic:   NaN on 5 and 0 DF,  p-value: NA


We can see that we get the same "dog love" model as produced by the third Python version of the last post. Again, we quickly verify that the output is the same (minus some rounding errors):

> -357.444 + 72.444 * responses$vanilla.love - 6.111 * responses$vanilla.love^2
+ 59.5 * responses$strawberry.love - 5.722 * responses$strawberry.love^2
+ 7 * responses\$chocolate.love

[1]  9.009 10.012  2.009  4.011  2.016 10.006