Note: This post was originally written using Julia v0.2, GLPK, and Hedonometer data through 2014. It has been updated to use Julia v1.11, HiGHS, and data through May 26, 2025.
Hedonometer popped onto my radar a couple weeks ago. It’s a nifty project, attempting to convert samples of words found in the Twitter Gardenhose feed into a time series of happiness.
While I’m not a computational social scientist, I must say the data does have a nice intuitive quality to it. There are obvious trends in happiness associated with major holidays, days of the week, and seasons. It seems like the sort of data that could be decomposed into trends based on those various components. The Hedonometer group has, of course, done extensive analyses of their own data which you can find on their papers page.
This post examines another approach. It follows the structure of Robert Vanderbei’s excellent “Local Warming” project to separate out the Hedonometer averages into daily, seasonal, solar, and day-of-the-week trends. We’ll be using Julia with JuMP and HiGHS for linear optimization, Gadfly for graphing, and a few other libraries. If you haven’t installed Julia, first do that. Missing packages should be installed when you import them.
Data
Hedonometer provides an API which we can use to pull daily happiness data in JSON format. We can request specific date rates, or leave the dates off to retrieve the full data set.
We use the HTTP
, JSON3
, and DataFrame
packages to read the Hedonometer data into a data frame. Calls to parse
convert strings to date and float types. Finally, we sort the data frame in place by ascending date.
using DataFrames, HTTP, JSON3
url = "https://hedonometer.org/api/v1/happiness/?format=json×eries__title=en_all"
response = HTTP.get(url)
df = DataFrame(JSON3.read(response.body)[:objects])
df.date = parse.(Date, df.date)
df.happiness = parse.(Float64, df.happiness)
sort!(df, :date)
5367ร4 DataFrame
Row โ date frequency happiness timeseries
โ Date Int64 Float64 String
โโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 2008-09-09 2009276 6.042 /api/v1/timeseries/3/
2 โ 2008-09-10 5263723 6.028 /api/v1/timeseries/3/
3 โ 2008-09-11 5298101 6.02 /api/v1/timeseries/3/
4 โ 2008-09-12 5351503 6.028 /api/v1/timeseries/3/
5 โ 2008-09-13 5153710 6.035 /api/v1/timeseries/3/
6 โ 2008-09-14 5170835 6.04 /api/v1/timeseries/3/
7 โ 2008-09-15 5553350 6.004 /api/v1/timeseries/3/
8 โ 2008-09-16 5421531 6.011 /api/v1/timeseries/3/
9 โ 2008-09-17 5380008 6.02 /api/v1/timeseries/3/
10 โ 2008-09-18 5591645 6.034 /api/v1/timeseries/3/
11 โ 2008-09-19 5695345 6.063 /api/v1/timeseries/3/
12 โ 2008-09-20 5291298 6.081 /api/v1/timeseries/3/
13 โ 2008-09-21 5363113 6.066 /api/v1/timeseries/3/
โฎ โ โฎ โฎ โฎ โฎ
5356 โ 2023-05-15 170487394 6.026 /api/v1/timeseries/3/
5357 โ 2023-05-16 174192397 6.021 /api/v1/timeseries/3/
5358 โ 2023-05-17 186034773 6.016 /api/v1/timeseries/3/
5359 โ 2023-05-18 189092448 6.03 /api/v1/timeseries/3/
5360 โ 2023-05-19 179957496 6.026 /api/v1/timeseries/3/
5361 โ 2023-05-20 167540306 6.044 /api/v1/timeseries/3/
5362 โ 2023-05-21 167091303 6.031 /api/v1/timeseries/3/
5363 โ 2023-05-22 171660415 6.03 /api/v1/timeseries/3/
5364 โ 2023-05-23 166443756 6.033 /api/v1/timeseries/3/
5365 โ 2023-05-24 183687637 6.025 /api/v1/timeseries/3/
5366 โ 2023-05-25 170265817 6.014 /api/v1/timeseries/3/
5367 โ 2023-05-26 180664806 6.032 /api/v1/timeseries/3/
5342 rows omitted
Note that the data does seem to be missing a few days. This means we need to compute day offsets in our model instead of using row indices.
last(df.date) - first(df.date)
nrow(df)
5372 days
5367
Now lets take a look at happiness over time, as computed by Hedonometer.
function plot_happiness(df::DataFrame)
plot(
df,
x=:date,
y=:happiness,
color=[colorant"darkblue"],
Guide.xlabel("Date"),
Guide.ylabel("Happiness"),
Coord.cartesian(
xmin=minimum(df.date),
xmax=maximum(df.date)
),
Theme(
key_position=:none,
line_width=0.75px,
background_color=colorant"white"
),
Geom.line
)
end
plot_happiness(df)
The data looks right, so we’re off to a good start. Now we have to think about what sort of components we believe are important factors to this index. We’ll start with the same ones as in the Vanderbei model:
- A linear happiness trend describing how our overall happiness changes over time.
- Seasonal trends accounting for mood changes with weather.
- Solar cycle trends.
We’ll add to this weekly trends, as zooming into the data shows we tend to be happier on the weekends than on work days. In the next section we’ll build a model to separate out the effects of these trends on the Hedonometer index.
Model
Vanderbei’s model analyzes daily temperature data for a particular location using least absolute deviations (LAD). This is similar to the well-known least squares approach, but while the latter penalizes the model quadratically more for bigger errors, the former does not. In mathematical notation, the least squares model takes in a known $m \times n$ matrix $A$ and $m \times 1$ vector $y$ of observed data, then searches for a vector $x$ such that $Ax = \hat{y}$ and $\sum_i \left\lVert y_i - \hat{y}_i \right\rVert_2^2$ is minimized.
The LAD model is similar in that it takes in the same data, but instead of minimizing the sum of the squared $L^2$ norms, it minimizes the sum of the $L^1$ norms. Thus we penalize our model using simply the absolute values of its errors instead of their squares. This makes the LAD model more robust, that is, less sensitive to outliers in our input data.
Using a robust model with this data set makes sense because it clearly contains a lot of outliers. While some of them, such as December 25th, may be recurrent, we’re going to ignore that detail for now. After all, not every day is Christmas.
We formulate our model below using JuMP with the HiGHS solver. The code works by defining a set of variables called coefficients
that will converge to optimal values for $x$. For each observation we compute a row of the $A$ matrix that has the following components:
- Linear daily trend ($a_1$ = day number in the data set)
- Seasonal variation: $\cos(2, \pi, a_1 / 365.25)$ and $\sin(2, \pi, a_1 / 365.25)$
- Solar cycle variation: $\cos(2, \pi, a_1 / (10.66 \times 365.25))$ and $\sin(2, \pi, a_1 / (10.66 \times 365.25))$
- Weekly variation: $\cos(2, \pi, a_1 / 7)$ and $\sin(2, \pi, a_1 / 7)$
We then add a linear variable representing the residual, or error, of the fitted model for each observation. Constraints enforce that these variables always take the absolute values of those errors.
Minimizing the sum of those residuals gives us a set of eight coefficients for the model. We return these and a function that predicts the happiness level for an offset from the first data record. (Note that the first record appears to be from Wednesday, September 9, 2008.)
function train(df::DataFrame)
m = Model(HiGHS.Optimizer)
# Define a linear variable for each of our regression coefficients.
# Note that by default, JuMP variables are unrestricted in sign.
@variable(m, x[1:8])
# Residuals are the absolute values of the error comparing our
# observed and fitted values.
#
# If alpha - beta = residual and alpha, beta >= 0, then we can min
# alpha + beta to get the absolute value of the residual.
@variable(m, alpha[1:nrow(df)] >= 0)
@variable(m, beta[1:nrow(df)] >= 0)
# This builds rows for determining fitted values. The first value is
# 1 since it is multiplied by our our trend line's offset. The other
# values correspond to the trends described above. Sinusoidal elements
# have two variables with sine and cosine terms.
function constants(i)
[
1, # Offset
i, # Daily trend
cos(2pi * i / 365.25), # Seasonal variation
sin(2pi * i / 365.25), #
cos(2pi * i / (10.66 * 365.25)), # Solar cycle variation
sin(2pi * i / (10.66 * 365.25)), #
cos(2pi * i / 7), # Weekly variation
sin(2pi * i / 7) #
]
end
# This builds a linear expression as the dot product of a row's
# constants and the coefficient variables.
expression(i) = dot(constants(i), x)
start = minimum(df.date)
for (i, row) in enumerate(eachrow(df))
days = (row.date - start).value
@constraint(m, alpha[i] - beta[i] == expression(days) - row.happiness)
end
# Minimize the total sum of these residuals.
@objective(m, Min, sum(alpha + beta))
optimize!(m)
# Return the model coefficients and a function that predicts happiness
# for a given day, by index from the start of the data set.
coefficients = value.(x)
# And we would like our model to work over vectors.
predict(i) = dot(constants(i), coefficients)
return coefficients, predict
end
coefficients, predictor = train(df)
coefficients
This gives us the optimal value of $x$. The second value is the change in happiness per day. We can see from this that there does seem to be a small negative trend.
8-element Vector{Float64}:
6.056241434337748
-1.2891297798930273e-5
-0.004956377505740697
0.00933036370632761
-0.014231170085464805
-0.01043882249306958
-0.01121031443373725
-0.003886782963711294
We can call our predictor function to obtain the fitted happiness level for any day number starting from September 9, 2008.
predictor(1000)
6.0206559094198635
Similarly, we can compute a range of fitted happiness values.
predictor.(1000:1009)
10-element Vector{Float64}:
6.0206559094198635
6.0133111627737295
6.01441070655176
6.0230645068990105
6.032696070016563
6.0359946962536455
6.030420605574473
6.020117462633424
6.012792131062921
6.013911265631212
Bootstrapping
We now have a set of coefficients and a predictive model. That’s nice, but we’d like to have some sense of a reasonable range on our model’s coefficients. For instance, how certain are we that our daily trend is really even positive? To deal with these uncertainties, we use a method called bootstrapping.
Bootstrapping involves building fake observed data based on our fitted model and its associated errors. We then fit the model to our fake data to determine new coefficients. If we repeat this enough times, we may be able to generate decent confidence intervals around our model coefficients.
First step: compute the errors between the observed and fitted data. We’ll construct a new data frame that contains everything we need to construct fake data.
# Compute fitted data corresponding to our observations and their associated errors.
start = minimum(df.date)
fitted = DataFrame(
date=df.date,
happiness=predictor.(map(d -> d.value, df.date .- start))
)
fitted.error = fitted.happiness - df.happiness
4ร7 DataFrame
Row โ variable mean min median max nmissing eltype
โ Symbol Unionโฆ Any Any Any Int64 DataType
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ date 2008-09-09 2016-01-19 2023-05-26 0 Date
2 โ observed 6.01506 5.628 6.016 6.376 0 Float64
3 โ fitted 6.01858 5.95996 6.0245 6.06771 0 Float64
4 โ error 0.00351745 -0.327932 0.0 0.353297 0 Float64
Note that the median for our errors is exactly zero. This is a good sign.
Now we build a function that creates a fake input data set using the fitted values with randomly selected errors. That is, for each observation, we add a randomly selected error with replacement to its corresponding fitted value. Once we’ve done that for every observation, we have a complete fake data set.
function fake_data(fitted::DataFrame)
indices = rand(1:nrow(fitted), nrow(fitted))
DataFrame(
date=fitted.date,
happiness=fitted.happiness + fitted.error[indices]
)
end
Le’ts plot some fake data to see if it looks similar.
plot_happiness(fake_data(fitted))
Visually, the plot of an arbitrary fake data set looks a lot like our original data, but not exactly.
Now we generate 199 fake data sets and run them through our optimization function above. This generates 100 sets of model coefficients and then computes $2\sigma$ confidence intervals around them.
The following code took a few minutes on my machine. If you’re intent on running it yourself, you may want to get some coffee in the meantime.
using HypothesisTests
coefficient_data = [train(fake_data(fitted))[1] for _ in 1:10]
confidence_intervals = map(
i -> confint(OneSampleTTest([c[i] for c in coefficient_data])),
1:length(coefficients)
)
confidence_intervals
8-element Vector{Tuple{Float64, Float64}}:
(6.055873073964057, 6.056558575993962)
(-1.304766046882304e-5, -1.2820860816666347e-5)
(-0.005375474015660127, -0.004919187929729383)
(0.009063427594182353, 0.009549696262121937)
(-0.01457854811479927, -0.014061680921618894)
(-0.010625952275843275, -0.010121904880077722)
(-0.011441598179220986, -0.010964483538449823)
(-0.004290285203751687, -0.003833687701331523)
Results
From the above output we can see that appear to be trending slightly less happy over time, with a daily trend of -0.00001293 in Hedonometer units and a 95% confidence interval on that trend of approximately -0.00001305, and -0.00001282. Bummer.
Now we take a quick look at our model output. First, we plot the fitted happiness values for the same time period as the observed data. We can see that this resembles the same general trend minus the outliers. The width of the curve is due to weekly variation.
plot_happiness(fitted)
Now we take a look at what a typical week looks like in terms of its effect on our happiness. As September 9, 2008 was a Wednesday, we index Sunday starting at 6.
daily(i) = coefficients[6]*cos(2pi*i/7) + coefficients[7]*sin(2pi*i/7)
plot(
x = ["Sun", "Mon", "Tues", "Wed", "Thurs", "Fri", "Sat"],
y = map(daily, [6, 7, 1, 2, 3, 4, 5]),
Guide.xlabel("Day of the Week"),
Guide.ylabel("Happiness"),
Geom.line,
Geom.point
)
The resulting graph highlights what I think we all already know about the work week.
That’s it for this analysis. We’ve learned that, for the being at least, we seem to be trending less happy. When we initially did this analysis, almost 11 years ago, the opposite was true. The fitted data shows pretty clearly when that trend took a stark turn down.
Exercises
The particularly ambitious reader may find the following exercises interesting.
- The code that reruns the model using randomly constructed fake data is eligible for parallelization. Rewrite the list comprehension that calls
train
so it runs concurrently. - According to Google, the lunar cycle is approximately 29.53 days. Add parameters for this to the LAD model above. Does it make sense to include the lunar cycle in the model? In other words, are we lunatics?
- Some of the happier days in the Hedonometer data, such as Christmas, are recurring, and therefore not really outliers. How might one go about accounting for the effects of those days?
- Try the same analysis using a least-squares model. Which model is better for this data?
Resources
are-we-getting-happier.jl
contains all the code in this posthedonometer.json
contains the Hedonometer data as of May 16, 2025