The charming adventures of an analyst and his solver.

*Predict how much people like cats and dogs based on their ice cream preferences. Also, Python and numpy.*

This post is based on a memo I sent to some former colleagues at the Post. I've edited it for use here since it fits well as the second in a series on simple data fitting techniques. If you're among the many enlightened individuals already using regression analysis, then this post is probably not for you. If you aren't, then hopefully this provides everything you need to develop rudimentary predictive models that yield surprising levels of accuracy.

For purposes of a simple working example, we have collected six records of input data over three dimensions with the goal of predicting two outputs. The input data are:

x1: How much a respondent likes vanilla [0-10] x2: How much a respondent likes strawberry [0-10] x3: How much a respondent likes chocolate [0-10]

Output data consist of:

b1: How much a respondent likes dogs [0-10] b2: How much a respondent likes cats [0-10]

Below are entirely anonymous data collected from people who bear no resemblance to certain Washington Post staffers.

respondent | vanilla love | strawberry love | chocolate love | dog love | cat love |
---|---|---|---|---|---|

Serdar | 9 | 4 | 9 | 9 | 9 |

Dan | 8 | 6 | 4 | 10 | 4 |

Nathaniel | 9 | 4 | 8 | 2 | 6 |

Lauren | 3 | 7 | 9 | 4 | 6 |

Jen | 6 | 8 | 5 | 2 | 5 |

Jackie | 4 | 5 | 3 | 10 | 3 |

Our input is in three dimensions. Each output requires its own model, so we'll have one for dogs and one for cats. We're looking for functions, dog(x) and cat(x), that can predict b1 and b2 based on given values of x1, x2, and x3.

For both models we want to find parameters that minimize their squared residuals (read: errors). There's a number of names for this. Optimization folks like to think of it as unconstrained quadratic optimization, but it's more common to call it least squares or linear regression. It's not necessary to entirely understand why for our purposes, but the function that minimizes these errors is:

$$\beta = ({A^t}A)^{-1}{A^t}b$$

This is implemented for you in the numpy.linalg Python package, which we'll use for examples. Much more information than you probably want can be found here.

Below is a first stab at a Python version. It runs least squares against our input and output data exactly as they are. You can see the matrix A and outputs b1 and b2 (dog and cat love, respectively) are represented just as they are in the table.

```
# Version 1: No offset, no squared inputs
import numpy
A = numpy.vstack([
[9, 4, 9],
[8, 6, 4],
[9, 4, 8],
[3, 7, 9],
[6, 8, 5],
[4, 5, 3]
])
b1 = numpy.array([9, 10, 2, 4, 2, 10])
b2 = numpy.array([9, 4, 6, 6, 5, 3])
print 'dog <3:', numpy.linalg.lstsq(A, b1)[0]
print 'cat <3:', numpy.linalg.lstsq(A, b2)[0]
# Output:
# dog <3: [0.72548294 0.53045642 -0.29952361]
# cat <3: [2.36110929e-01 2.61934385e-05 6.26892476e-01]
```

The resulting model is:

dog(x) = 0.72548294 * x1 + 0.53045642 * x2 - 0.29952361 * x3 cat(x) = 2.36110929e-01 * x1 + 2.61934385e-05 * x2 + 6.26892476e-01 * x3

The coefficients before our variables correspond to beta in the formula above. Errors between observed and predicted data, shown below, are calculated and summed. For these six records, dog(x) has a total error of 20.76 and cat(x) has 3.74. Not great.

Person | predicted b1 | b1 error | predicted b2 | b2 error |
---|---|---|---|---|

Serdar | 5.96 | 3.04 | 7.77 | 1.23 |

Dan | 7.79 | 2.21 | 4.40 | 0.40 |

Nathaniel | 6.25 | 4.25 | 7.14 | 1.14 |

Lauren | 3.19 | 0.81 | 6.35 | 0.35 |

Jen | 7.10 | 5.10 | 4.55 | 0.45 |

Jackie | 4.66 | 5.34 | 2.83 | 0.17 |

Total error: | 20.76 | 3.74 |

One problem with this model is that dog(x) and cat(x) are forced to pass through the origin. *(Why is that?)* We can improve it somewhat if we add an offset. This amounts to prepending 1 to every row in A and adding a constant to the resulting functions. You can see the very slight difference between the code for this model and that of the previous:

```
# Version 2: Offset, no squared inputs
import numpy
A = numpy.vstack([
[1, 9, 4, 9],
[1, 8, 6, 4],
[1, 9, 4, 8],
[1, 3, 7, 9],
[1, 6, 8, 5],
[1, 4, 5, 3]
])
b1 = numpy.array([9, 10, 2, 4, 2, 10])
b2 = numpy.array([9, 4, 6, 6, 5, 3])
print 'dog <3:', numpy.linalg.lstsq(A, b1)[0]
print 'cat <3:', numpy.linalg.lstsq(A, b2)[0]
# Output:
# dog <3: [20.92975427 -0.27831197 -1.43135684 -0.76469017]
# cat <3: [-0.31744124 0.25133547 0.02978098 0.63394765]
```

This yields the seconds version of our models:

dog(x) = 20.92975427 - 0.27831197 * x1 - 1.43135684 * x2 - 0.76469017 * x3 cat(x) = -0.31744124 + 0.25133547 * x1 + 0.02978098 * x2 + 0.63394765 * x3

These models provide errors of 13.87 and 3.79. A little better on the dog side, but still not quite usable.

Person | predicted b1 | b1 error | predicted b2 | b2 error |
---|---|---|---|---|

Serdar | 5.82 | 3.18 | 7.77 | 1.23 |

Dan | 7.06 | 2.94 | 4.41 | 0.41 |

Nathaniel | 6.58 | 4.58 | 7.14 | 1.14 |

Lauren | 3.19 | 0.81 | 6.35 | 0.35 |

Jen | 3.99 | 1.99 | 4.60 | 0.40 |

Jackie | 10.37 | 0.37 | 2.74 | 0.26 |

Total error: | 13.87 | 3.79 |

The problem is that dog(x) and cat(x) are linear functions. Most observed data don't conform to straight lines. Take a moment and draw the line $f(x) = x$ and the curve $f(x) = x^2$. The former makes a poor approximation of the latter.

Most of the time, people just use squares of the input data to add curvature to their models. We do this in our next version of the code by just adding squares of the input row values to our A matrix. Everything else is the same. (In reality, you can add any function of the input data you feel best models the data, if you understand it well enough.)

```
# Version 3: Offset with squared inputs
import numpy
A = numpy.vstack([
[1, 9, 9**2, 4, 4**2, 9, 9**2],
[1, 8, 8**2, 6, 6**2, 4, 4**2],
[1, 9, 9**2, 4, 4**2, 8, 8**2],
[1, 3, 3**2, 7, 7**2, 9, 9**2],
[1, 6, 6**2, 8, 8**2, 5, 5**2],
[1, 4, 4**2, 5, 5**2, 3, 3**2]
])
b1 = numpy.array([9, 10, 2, 4, 2, 10])
b2 = numpy.array([9, 4, 6, 6, 5, 3])
print 'dog <3:', numpy.linalg.lstsq(A, b1)[0]
print 'cat <3:', numpy.linalg.lstsq(A, b2)[0]
# dog <3: [1.29368307 7.03633306 -0.44795498 9.98093332
# -0.75689575 -19.00757486 1.52985734]
# cat <3: [0.47945896 5.30866067 -0.39644128 -1.28704188
# 0.12634295 -4.32392606 0.43081918]
```

This gives us our final version of the model:

dog(x) = 1.29368307 + 7.03633306 * x1 - 0.44795498 * x1**2 + 9.98093332 * x2 - 0.75689575 * x2**2 - 19.00757486 * x3 + 1.52985734 * x3**2 cat(x) = 0.47945896 + 5.30866067 * x1 - 0.39644128 * x1**2 - 1.28704188 * x2 + 0.12634295 * x2**2 - 4.32392606 * x3 + 0.43081918 * x3**2

Adding curvature to our model eliminates all perceived error, at least within 1e-16. This may seem unbelievable, but when you consider that we only have six input records, it isn't really.

Person | predicted b1 | b1 error | predicted b2 | b2 error |
---|---|---|---|---|

Serdar | 9 | 0 | 9 | 0 |

Dan | 10 | 0 | 4 | 0 |

Nathaniel | 2 | 0 | 6 | 0 |

Lauren | 4 | 0 | 6 | 0 |

Jen | 2 | 0 | 5 | 0 |

Jackie | 10 | 0 | 3 | 0 |

Total error: | 0 | 0 |

It should be fairly obvious how one can take this and extrapolate to much larger models. I hope this is useful and that least squares becomes an important part of your lives.