adventures in optimization

The charming adventures of an analyst and his solver.

« Monday, July 29, 2013 »

Network Splitting

Splitting large networks into fully connected subnetworks. Also, Las Vegas!

Note: A reader pointed out that Union-Find is a very efficient way to accomplish this task. So if you need to do something similar, start there!

Last week, Paul Rubin wrote an excellent post on Extracting a Connected Graph from an existing graph. Lately I've been performing related functions on data from OpenStreetMap, though without access to a solver. In my case I'm taking in arbitrary network data and splitting it into disconnected subnetworks. I thought it might be a good case study to show an algorithmic way doing this and some of the performance issues I ran into.

A small example can be seen below. This shows a road network around the Las Vegas strip. There is one main (weakly) connected network in black. The roads highlighted in red are disconnected from the main network. We want code that will split these into connected subnetworks.

Say we have data that looks like the following. Instead of nodes, the numbers in quotes represent edges. Think of these as streets.

{
    "0": [1, 2, 3],
    "1": [9248, 9249, 9250],
    "2": [589, 9665, 9667],
    "3": [0, 5, 6],
    "4": [0, 5, 6],
    "5": [588],
    "6": [4, 8, 9],
    ...
}

Our basic strategy is the following:

  1. Start with every edge alone in its own subnetwork.
  2. For each connection, merge the networks of the source and destination edges.

#!/usr/bin/env python
import json
import sys
import time

class hset(set):
    '''A hashable set. Note that it only hashes by the pointer, and not by the elements.'''
    def __hash__(self):
        return hash(id(self))

    def __cmp__(self, other):
        return cmp(id(self), id(other))

if __name__ == '__main__':
    try:
        inputfile = sys.argv[1]
    except:
        print 'usage: %s network.json' % sys.argv[0]
        sys.exit()

    print time.asctime(), 'parsing json input'
    connections = json.load(open(inputfile))

    edge_to_net = {} # Edge ID -> set([edges that are in the same network])
    nets = set()     # Set of known networks

    print time.asctime(), 'detecting disconnected subgraphs'
    for i, (from_edge, to_set) in enumerate(connections.iteritems()):
        from_edge = int(from_edge)

        try:
            from_net = edge_to_net[from_edge]
        except KeyError:
            from_net = edge_to_net[from_edge] = hset([from_edge])
            nets.add(from_net)

        if not (i+1) % (25 * 1000):
            print time.asctime(), '%d edges processed / %d current subnets' % (i+1, len(nets))

        for to in to_set:
            try:
                to_net = edge_to_net[to]

                # If we get here, merge the to_net into the from_net.
                if to_net is not from_net:
                    to_net.update(from_net)
                    for e in from_net:
                        edge_to_net[e] = to_net
                    nets.remove(from_net)
                    from_net = to_net

            except KeyError:
                from_net.add(to)
                edge_to_net[to] = from_net

    print time.asctime(), len(nets), 'subnets found'

We run this against the network pictured above and it works reasonably quickly, finishing in about 7 seconds:

Mon Jul 29 12:22:38 2013 parsing json input
Mon Jul 29 12:22:38 2013 detecting disconnected subgraphs
Mon Jul 29 12:22:38 2013 25000 edges processed / 1970 current subnets
Mon Jul 29 12:22:44 2013 50000 edges processed / 124 current subnets
Mon Jul 29 12:22:45 2013 60 subnets found

However, when run against a road network for an entire city, the process continues for several hours. What is the issue?

The inefficiency occurs from lines 46 to 50. In this we are frequently removing references to every element in a large set. Instead, it would be better to remove as few references as possible. Therefore, instead of merging from_net into to_net, we will determine which network is the smaller of the two and marge that one into the larger one. Note that this does not necessarily change the worst case time complexity of the algorithm, but it should make the code fast enough to be useful. The new version appears below.

#!/usr/bin/env python
import json
import sys
import time

class hset(set):
    '''A hashable set. Note that it only hashes by the pointer, and not by the elements.'''
    def __hash__(self):
        return hash(id(self))

    def __cmp__(self, other):
        return cmp(id(self), id(other))

if __name__ == '__main__':
    try:
        inputfile = sys.argv[1]
    except:
        print 'usage: %s network.json' % sys.argv[0]
        sys.exit()

    print time.asctime(), 'parsing json input'
    connections = json.load(open(inputfile))

    edge_to_net = {} # Edge ID -> set([edges that are in the same network])
    nets = set()     # Set of known networks

    print time.asctime(), 'detecting disconnected subgraphs'
    for i, (from_edge, to_set) in enumerate(connections.iteritems()):
        from_edge = int(from_edge)

        try:
            from_net = edge_to_net[from_edge]
        except KeyError:
            from_net = edge_to_net[from_edge] = hset([from_edge])
            nets.add(from_net)

        if not (i+1) % (25 * 1000):
            print time.asctime(), '%d edges processed / %d current subnets' % (i+1, len(nets))

        for to in to_set:
            try:
                to_net = edge_to_net[to]

                # If we get here, merge the to_net into the from_net.
                if to_net is not from_net:
                    # Update references to and remove the smaller set for speed.
                    if len(to_net) < len(from_net):
                        smaller, larger = to_net, from_net
                    else:
                        smaller, larger = from_net, to_net

                    larger.update(smaller)
                    for e in smaller:
                        edge_to_net[e] = larger
                    nets.remove(smaller)
                    edge_to_net[to] = larger
                    from_net = larger

            except KeyError:
                from_net.add(to)
                edge_to_net[to] = from_net

    print time.asctime(), len(nets), 'subnets found'

Indeed, this is significantly faster. And on very large networks it runs in minutes instead of hours or days. On the small test case used for this post, it runs in under a second. While this could probably be done faster, that's actually good enough for right now.

Mon Jul 29 12:39:55 2013 parsing json input
Mon Jul 29 12:39:55 2013 detecting disconnected subgraphs
Mon Jul 29 12:39:55 2013 25000 edges processed / 1970 current subnets
Mon Jul 29 12:39:55 2013 50000 edges processed / 124 current subnets
Mon Jul 29 12:39:55 2013 60 subnets found