Note: A reader pointed out that Union-Find is a very efficient way to accomplish this task. Start there if you have the same problem!
Last week, Paul Rubin wrote an excellent post on Extracting a Connected Graph from an existing graph. Lately I’ve been performing related functions on data from OpenStreetMap, though without access to a solver. In my case I’m taking in arbitrary network data and splitting it into disconnected sub-networks. I thought it might be a good case study to show an algorithmic way doing this and some of the performance issues I ran into.
A small example can be seen below. This shows a road network around the Las Vegas strip. There is one main (weakly) connected network in black. The roads highlighted in red are disconnected from the main network. We want code that will split these into connected sub-networks.
Say we have data that looks like the following. Instead of nodes, the numbers in quotes represent edges. Think of these as streets.
{
"0": [1, 2, 3],
"1": [9248, 9249, 9250],
"2": [589, 9665, 9667],
"3": [0, 5, 6],
"4": [0, 5, 6],
"5": [588],
"6": [4, 8, 9],
...
}
Our basic strategy is the following:
- Start with every edge alone in its own subnetwork.
- For each connection, merge the networks of the source and destination edges.
#!/usr/bin/env python3
import json
import sys
import time
class hset(set):
'''A hashable set. Note that it only hashes by the pointer, and not by the elements.'''
def __hash__(self):
return hash(id(self))
def __cmp__(self, other):
return cmp(id(self), id(other))
if __name__ == '__main__':
try:
inputfile = sys.argv[1]
except:
print 'usage: %s network.json' % sys.argv[0]
sys.exit()
print(time.asctime(), 'parsing json input')
connections = json.load(open(inputfile))
edge_to_net = {} # Edge ID -> set([edges that are in the same network])
nets = set() # Set of known networks
print(time.asctime(), 'detecting disconnected subgraphs')
for i, (from_edge, to_set) in enumerate(connections.iteritems()):
from_edge = int(from_edge)
try:
from_net = edge_to_net[from_edge]
except KeyError:
from_net = edge_to_net[from_edge] = hset([from_edge])
nets.add(from_net)
if not (i+1) % (25 * 1000):
print time.asctime(), '%d edges processed / %d current subnets' % (i+1, len(nets))
for to in to_set:
try:
to_net = edge_to_net[to]
# If we get here, merge the to_net into the from_net.
if to_net is not from_net:
to_net.update(from_net)
for e in from_net:
edge_to_net[e] = to_net
nets.remove(from_net)
from_net = to_net
except KeyError:
from_net.add(to)
edge_to_net[to] = from_net
print(time.asctime(), len(nets), 'subnets found')
We run this against the network pictured above and it works reasonably quickly, finishing in about 7 seconds:
Mon Jul 29 12:22:38 2013 parsing json input
Mon Jul 29 12:22:38 2013 detecting disconnected subgraphs
Mon Jul 29 12:22:38 2013 25000 edges processed / 1970 current subnets
Mon Jul 29 12:22:44 2013 50000 edges processed / 124 current subnets
Mon Jul 29 12:22:45 2013 60 subnets found
However, when run against a road network for an entire city, the process continues for several hours. What is the issue?<
The inefficiency occurs from lines 46 to 50. In this we are frequently removing references to every element in a large set. Instead, it would be better to remove as few references as possible. Therefore, instead of merging from_net
into to_net
, we will determine which network is the smaller of the two and marge that one into the larger one. Note that this does not necessarily change the worst case time complexity of the algorithm, but it should make the code fast enough to be useful. The new version appears below.`
# !/usr/bin/env python
import json
import sys
import time
class hset(set):
'''A hashable set. Note that it only hashes by the pointer, and not by the elements.'''
def __hash__(self):
return hash(id(self))
def __cmp__(self, other):
return cmp(id(self), id(other))
if __name__ == '__main__':
try:
inputfile = sys.argv[1]
except:
print('usage: %s network.json' % sys.argv[0])
sys.exit()
print(time.asctime(), 'parsing json input')
connections = json.load(open(inputfile))
edge_to_net = {} # Edge ID -> set([edges that are in the same network])
nets = set() # Set of known networks
print(time.asctime(), 'detecting disconnected subgraphs')
for i, (from_edge, to_set) in enumerate(connections.iteritems()):
from_edge = int(from_edge)
try:
from_net = edge_to_net[from_edge]
except KeyError:
from_net = edge_to_net[from_edge] = hset([from_edge])
nets.add(from_net)
if not (i+1) % (25 * 1000):
print(time.asctime(), '%d edges processed / %d current subnets' % (i+1, len(nets)))
for to in to_set:
try:
to_net = edge_to_net[to]
# If we get here, merge the to_net into the from_net.
if to_net is not from_net:
# Update references to and remove the smaller set for speed.
if len(to_net) < len(from_net):
smaller, larger = to_net, from_net
else:
smaller, larger = from_net, to_net
larger.update(smaller)
for e in smaller:
edge_to_net[e] = larger
nets.remove(smaller)
edge_to_net[to] = larger
from_net = larger
except KeyError:
from_net.add(to)
edge_to_net[to] = from_net
print(time.asctime(), len(nets), 'subnets found')
Indeed, this is significantly faster. And on very large networks it runs in minutes instead of hours or days. On the small test case used for this post, it runs in under a second. While this could probably be done faster, that’s actually good enough for right now.
Mon Jul 29 12:39:55 2013 parsing json input
Mon Jul 29 12:39:55 2013 detecting disconnected subgraphs
Mon Jul 29 12:39:55 2013 25000 edges processed / 1970 current subnets
Mon Jul 29 12:39:55 2013 50000 edges processed / 124 current subnets
Mon Jul 29 12:39:55 2013 60 subnets found