This is based on a lightning talk I gave at the LA PyLadies October Hackathon.
I’m actually not going to go into anything much resembling algorithmic complexity here. What I’d like to do is present a common performance anti-pattern that I see from novice programmers about once every year or so. If I can prevent one person from committing this error, this post will have achieved its goal. I’d also like to show how an intuitive understanding of time required by operations in relation to the size of data they operate on can be helpful.
Say you have a Big List of Things. It doesn’t particularly matter what these things are. Often they might be objects or dictionaries of denormalized data. In this example we’ll use numbers. Let’s generate a list of 1 million integers, each randomly chosen from the first 100 thousand natural numbers:
import random
choices = range(100000)
x = [random.choice(choices) for i in range(1000000)]
Now say you want to remove (or aggregate, or structure) duplicate data while keeping them in order of appearance. Intuitively, this seems simple enough. A first solution might involve creating a new empty list, iterating over x, and only appending those items that are not already in the new list.
The Bad Way
order = []
for i in x:
if i not in order:
order.append(i)
Try running this. What’s wrong with it?
The issue is the conditional on line 3. In the worst case, it could look at every item in the order list for each item in x. If the list is big, as it is in our example, that wastes a lot of cycles. We can reason that we can improve the performance of our code by replacing this conditional with something faster.
The Good Way
Given that sets have near constant time for membership tests, one solution is to create a companion data structure, which we’ll call seen. Being a set, it doesn’t care about the order of the items, but it will allow us to test for membership quickly.
order = []
seen = set()
for i in x:
if i not in seen:
seen.add(i)
order.append(i)
Now try running this. Better?
Not that this is the best way to perform this particular action. If you aren’t familiar with it, take a look at the groupby
function from itertools
, which is what I will sometimes reach for in a case like this.