The charming adventures of an analyst and his solver.

« Tuesday, November 30, 2010 »

Off-the-Cuff Voter Fraud Detection

Using the exponential distribution to interpret votes in a web survey.

Consider this scenario: You run a contest that accepts votes from the general Internet population. In order to encourage user engagement, you record any and all votes into a database over several days, storing nothing more than the competitor voted for, when each vote is cast, and a cookie set on the voter's computer along with their apparent IP addresses. If a voter already has a recorded cookie set they are denied subsequent votes. This way you can avoid requiring site registration, a huge turnoff for your users. Simple enough.

Unfortunately, some of the competitors are wily and attached to the idea of winning. They go so far as programming or hiring bots to cast thousands of votes for them. Your manager wants to know which votes are real and which ones are fake Right Now. Given very limited time, and ignoring actions that you could have taken to avoid the problem, how can you tell apart sets of good votes from those that shouldn't be counted?

One quick-and-dirty option involves comparing histograms of interarrival times for sets of votes. Say you're concerned that all the votes during a particular period of time or from a given IP address might be fraudulent. Put all the vote times you're concerned about into a list, sort them, and compute their differences:

# times is a list of datetime instances from vote records
times.sort(reversed=True)
interarrivals = [y-x for x, y in zip(times, times[1:]]


Now use matplotlib to display a histogram of these. Votes that occur naturally are likely to resemble an exponential distribution in their interarrival times. For instance, here are interarrival times for all votes received in a contest:

Interarrival times for all submissions

This subset of votes is clearly fraudulent, due to the near determinism of their interarrival times. This is most likely caused by the voting bot not taking random sleep intervals during voting. It casts a vote, receives a response, clears its cookies, and repeats:

Interarrival times for clearly fraudulent votes

These votes, on the other hand, are most likely legitimate. They exhibit a nice Erlang shape and appear to have natural interarrival times that one would expect:

Proper-looking interarrival times

Of course this method is woefully inadequate for rigorous detection of voting fraud. Ideally one would find a method to compute the probability that a set of votes is generated by a bot. This is enough to inform quick, ad hoc decisions though.