The charming adventures of an analyst and his solver.

« Tuesday, February 15, 2011 »

## Data Fitting Part 2 - Very, Very Simple Linear Regression in Python

Predict how much people like cats and dogs based on their ice cream preferences. Also, Python and numpy.

This post is based on a memo I sent to some former colleagues at the Post. I've edited it for use here since it fits well as the second in a series on simple data fitting techniques. If you're among the many enlightened individuals already using regression analysis, then this post is probably not for you. If you aren't, then hopefully this provides everything you need to develop rudimentary predictive models that yield surprising levels of accuracy.

For purposes of a simple working example, we have collected six records of input data over three dimensions with the goal of predicting two outputs. The input data are:

    x1: How much a respondent likes vanilla [0-10]
x2: How much a respondent likes strawberry [0-10]
x3: How much a respondent likes chocolate [0-10]


Output data consist of:

    b1: How much a respondent likes dogs [0-10]
b2: How much a respondent likes cats [0-10]


Below are entirely anonymous data collected from people who bear no resemblance to certain Washington Post staffers.

respondentvanilla lovestrawberry lovechocolate lovedog lovecat love
Serdar94999
Dan864104
Nathaniel94826
Lauren37946
Jen68525
Jackie453103

Our input is in three dimensions. Each output requires its own model, so we'll have one for dogs and one for cats. We're looking for functions, dog(x) and cat(x), that can predict b1 and b2 based on given values of x1, x2, and x3.

For both models we want to find parameters that minimize their squared residuals (read: errors). There's a number of names for this. Optimization folks like to think of it as unconstrained quadratic optimization, but it's more common to call it least squares or linear regression. It's not necessary to entirely understand why for our purposes, but the function that minimizes these errors is:

$$\beta = ({A^t}A)^{-1}{A^t}b$$

This is implemented for you in the numpy.linalg Python package, which we'll use for examples. Much more information than you probably want can be found here.

Below is a first stab at a Python version. It runs least squares against our input and output data exactly as they are. You can see the matrix A and outputs b1 and b2 (dog and cat love, respectively) are represented just as they are in the table.

# Version 1: No offset, no squared inputs

import numpy

A = numpy.vstack([
[9, 4, 9],
[8, 6, 4],
[9, 4, 8],
[3, 7, 9],
[6, 8, 5],
[4, 5, 3]
])

b1 = numpy.array([9, 10, 2, 4, 2, 10])
b2 = numpy.array([9, 4, 6, 6, 5, 3])

print 'dog <3:', numpy.linalg.lstsq(A, b1)
print 'cat <3:', numpy.linalg.lstsq(A, b2)

# Output:
# dog <3: [0.72548294      0.53045642     -0.29952361]
# cat <3: [2.36110929e-01  2.61934385e-05  6.26892476e-01]


The resulting model is:

dog(x) = 0.72548294 * x1 + 0.53045642 * x2 - 0.29952361 * x3
cat(x) = 2.36110929e-01 * x1 + 2.61934385e-05 * x2 + 6.26892476e-01 * x3


The coefficients before our variables correspond to beta in the formula above. Errors between observed and predicted data, shown below, are calculated and summed. For these six records, dog(x) has a total error of 20.76 and cat(x) has 3.74. Not great.

Personpredicted b1b1 errorpredicted b2b2 error
Serdar5.963.047.771.23
Dan7.792.214.400.40
Nathaniel6.254.257.141.14
Lauren3.190.816.350.35
Jen7.105.104.550.45
Jackie4.665.342.830.17
Total error:20.763.74

One problem with this model is that dog(x) and cat(x) are forced to pass through the origin. (Why is that?) We can improve it somewhat if we add an offset. This amounts to prepending 1 to every row in A and adding a constant to the resulting functions. You can see the very slight difference between the code for this model and that of the previous:

# Version 2: Offset, no squared inputs

import numpy

A = numpy.vstack([
[1, 9, 4, 9],
[1, 8, 6, 4],
[1, 9, 4, 8],
[1, 3, 7, 9],
[1, 6, 8, 5],
[1, 4, 5, 3]
])

b1 = numpy.array([9, 10, 2, 4, 2, 10])
b2 = numpy.array([9, 4, 6, 6, 5, 3])

print 'dog <3:', numpy.linalg.lstsq(A, b1)
print 'cat <3:', numpy.linalg.lstsq(A, b2)

# Output:
# dog <3: [20.92975427  -0.27831197  -1.43135684  -0.76469017]
# cat <3: [-0.31744124   0.25133547   0.02978098   0.63394765]


This yields the seconds version of our models:

dog(x) = 20.92975427 - 0.27831197 * x1 - 1.43135684 * x2 - 0.76469017 * x3
cat(x) = -0.31744124 + 0.25133547 * x1 + 0.02978098 * x2 + 0.63394765 * x3


These models provide errors of 13.87 and 3.79. A little better on the dog side, but still not quite usable.

Personpredicted b1b1 errorpredicted b2b2 error
Serdar5.823.187.771.23
Dan7.062.944.410.41
Nathaniel6.584.587.141.14
Lauren3.190.816.350.35
Jen3.991.994.600.40
Jackie10.370.372.740.26
Total error:13.873.79

The problem is that dog(x) and cat(x) are linear functions. Most observed data don't conform to straight lines. Take a moment and draw the line $f(x) = x$ and the curve $f(x) = x^2$. The former makes a poor approximation of the latter.

Most of the time, people just use squares of the input data to add curvature to their models. We do this in our next version of the code by just adding squares of the input row values to our A matrix. Everything else is the same. (In reality, you can add any function of the input data you feel best models the data, if you understand it well enough.)

# Version 3: Offset with squared inputs

import numpy

A = numpy.vstack([
[1, 9, 9**2, 4, 4**2, 9, 9**2],
[1, 8, 8**2, 6, 6**2, 4, 4**2],
[1, 9, 9**2, 4, 4**2, 8, 8**2],
[1, 3, 3**2, 7, 7**2, 9, 9**2],
[1, 6, 6**2, 8, 8**2, 5, 5**2],
[1, 4, 4**2, 5, 5**2, 3, 3**2]
])

b1 = numpy.array([9, 10, 2, 4, 2, 10])
b2 = numpy.array([9, 4, 6, 6, 5, 3])

print 'dog <3:', numpy.linalg.lstsq(A, b1)
print 'cat <3:', numpy.linalg.lstsq(A, b2)

# dog <3: [1.29368307  7.03633306  -0.44795498  9.98093332
#  -0.75689575  -19.00757486  1.52985734]
# cat <3: [0.47945896  5.30866067  -0.39644128 -1.28704188
#   0.12634295   -4.32392606  0.43081918]


This gives us our final version of the model:

dog(x) = 1.29368307 + 7.03633306 * x1 - 0.44795498 * x1**2 + 9.98093332 * x2 - 0.75689575 * x2**2 - 19.00757486 * x3 + 1.52985734 * x3**2
cat(x) = 0.47945896 + 5.30866067 * x1 - 0.39644128 * x1**2 - 1.28704188 * x2 + 0.12634295 * x2**2 - 4.32392606 * x3 + 0.43081918 * x3**2


Adding curvature to our model eliminates all perceived error, at least within 1e-16. This may seem unbelievable, but when you consider that we only have six input records, it isn't really.

Personpredicted b1b1 errorpredicted b2b2 error
Serdar9090
Dan10040
Nathaniel2060
Lauren4060
Jen2050
Jackie10030
Total error:00

It should be fairly obvious how one can take this and extrapolate to much larger models. I hope this is useful and that least squares becomes an important part of your lives.