adventures in optimization

The charming adventures of an analyst and his solver.

« Wednesday, May 28, 2014 »

Preprocessing for Routing Problems - Part 1

Preprocessing techniques for splitting space into regions which are closest to a set of points.

Consider an instance of the vehicle routing problem in which we have drivers that are geographically distributed, each in a unique location. Our goal is to deliver goods or services to a set of destinations at the lowest cost. It does not matter to our customers which driver goes to which destination, so long as the deliveries are made.

One can think of this problem as a collection of travelling salesman problems, where there are multiple salespeople in different locations and a shared set of destinations. We attempt to find the minimum cost schedule for all salespeople that visits all destinations, where each salesman can technically go anywhere.

We believe that sending a driver farther will result in increased cost. But, given a particularly good tour, we might do that anyway. On the other hand, there are plenty of assignments we would never consider. It would be madness to send a driver from Los Angeles to New York if we already have another person stationed near there. Thus there are a large number of scenarios that may be possible, but that we will never pursue.

Our ultimate goal is to construct a model that finds an optimal (or near-optimal) schedule. Before we get to that, we have a bit of preprocessing to do. We would like to create regions for our drivers that make some bit of sense, balancing constraints on travel time with redundant coverage of our customers. Once we have these regions, we will know where we can allow our drivers to go in the final schedule.

Let's get started in R. We'll assume that we have drivers stationed at our regional headquarters in the 25 largest US cities by population. We assume that every possible customer address will be in some five digit zip code in the continental US. We pull this information out of the standard R data sets and pare down to only unique locations, fixing a couple errors in the data along the way.

library(datasets)
library(zipcode)
data(zipcode)

# Alexandria, VA is not in Normandy, France.
zipcode[zipcode$zip=='22350', c('latitude', 'longitude')] <- c(38.863930, -77.055547)

# New York City, NY is not in Kyrgyzstan.
zipcode$longitude[zipcode$zip=='10200'] <- -zipcode$longitude[zipcode$zip=='10200']

# Pare down to zip codes in the continental US.
states_continental <- state.abb[!(state.abb %in% c('AK', 'HI'))]
zips_continental <- subset(zipcode, state %in% states_continental)
zips_deduped <- zips_continental[!duplicated(zips_continental[, c('latitude', 'longitude')]), ]

# Geographic information for top 25 cities in the country.
library(maps)
data(us.cities)
largest_cities <- subset(
    us.cities[order(us.cities$pop, decreasing=T),][1:25,],
    select = c('name', 'lat', 'long')
)

With this information we can get some sense of what we're up against. We generate a map off all the zip code locations in blue and our driver locations in red.

# Plot our corporate headquarters and every unique zip code location.
map('state')
points(zips_deduped$longitude, zips_deduped$latitude, pch=21, col=rgb(0, 0, 1, .5), cex=0.1)
points(largest_cities$long, largest_cities$lat, pch=21, bg=rgb(1, 0, 0, .75), col='black', cex=1.5)


Zip code and driver locations

So how do we go about assigning zip codes to our drivers? One option is to draw circles of a given radius around our drivers and increase that radius until we have the coverage we need.


Radius-based regions

On second thought, that doesn't work so well. By the time we have large enough radius, there will be so much overlap the assignments won't make much sense. It would be better if we started by assigning each zip code to the driver that is physically closest. We could then start introducing redundancy into our data by adding the second closest driver, and so on.

# Euclidean distance from each HQ to each zip code.
library(SpatialTools)
zips_to_hqs_dist <- dist2(
    matrix(c(zips_deduped$longitude, zips_deduped$latitude), ncol=2),
    matrix(c(largest_cities$long, largest_cities$lat), ncol=2)
)

# Rank HQs by their distance to each unique zip code location.
hqs_to_zips_rank <- matrix(nrow=nrow(largest_cities), ncol=nrow(zips_deduped))
for (i in 1:nrow(zips_deduped)) {
    hqs_to_zips_rank[,i] <- rank(zips_to_hqs_dist[i,], ties.method='first')
}

Let's see what this looks like on the map. The following shows what the region for the Dallas, TX driver would be if she were only allowed to visit zip codes for which she is the closest, second closest, and third closest. We map these as polygons using the convex hull of their respective zip code locations.

# Now we draw regions for which Dallas is one of the closest 3 HQs.
hq_idx <- which(largest_cities$name == 'Dallas TX')
redundancy_levels <- c(3, 2, 1)
fill_alpha <- c(0.15, 0.30, 0.45)

map('state')
for (i in 1:length(redundancy_levels)) {
    # Find every zip for which this HQ is within n in distance rank.
    within_n <- hqs_to_zips_rank[hq_idx,] <= redundancy_levels[i]

    # Convex hull of zip code points.
    hull_order <- chull(
        zips_deduped$longitude[within_n],
        zips_deduped$latitude[within_n]
    )
    hull_x <- zips_deduped$longitude[within_n][hull_order]
    hull_y <- zips_deduped$latitude[within_n][hull_order]
    polygon(hull_x, hull_y, border='blue', col=rgb(0, 0, 1, fill_alpha[i]))
}

# The other HQs.
other_hqs = 1:nrow(largest_cities) != hq_idx
points(
    largest_cities$long[other_hqs],
    largest_cities$lat[other_hqs],
    pch = 21,
    bg = rgb(0.4, 0.4, 0.4, 0.6),
    col = 'black',
    cex = 1.5
)

# This HQ.
points(
    largest_cities$long[hq_idx],
    largest_cities$lat[hq_idx],
    pch = 21,
    bg = rgb(1, 0, 0, .85),
    col = 'black',
    cex = 1.5
)


Euclidean distance-based regions for Dallas, TX

This makes a bit more sense. If we enforce a redundancy level of 1, then every zip code has exactly one person assigned to it. As we increase that redundancy level, we have more options in terms of driver assignment. And our optimization model will grow correspondingly in size.

The following produces a map of all our regions where each zip code is served only by its closest driver.

# Map of regions where every zip is served only by its closest HQ.
map('usa')
for (hq_idx in 1:nrow(largest_cities)) {
    # Find every zip for which this HQ is the closest.
    within_1 <- hqs_to_zips_rank[hq_idx,] == 1
    within_1[is.na(within_1)] <- F

    # Convex hull of zip code points.
    hull_order <- chull(
        zips_deduped$longitude[within_1],
        zips_deduped$latitude[within_1]
    )
    hull_x <- zips_deduped$longitude[within_1][hull_order]
    hull_y <- zips_deduped$latitude[within_1][hull_order]
    polygon(
        hull_x,
        hull_y,
        border = 'black',
        col = rgb(0, 0, 1, 0.25)
    )
}

# All HQs
points(
    largest_cities$long,
    largest_cities$lat,
    pch = 21,
    bg = rgb(1, 0, 0, .75),
    col = 'black',
    cex = 1.5
)


All regions with redundancy level of 1

This is a good start. Our preprocessing step gives us a reasonable level of control over the assignments of drivers before we begin optimizing. So what's missing?

One immediately apparent failure is that these regions are based on Euclidean distance. Travel time is not a simple function of that. It would be much better if we could create regions using estimated time, drawing them based on topology of the highway system. We'll explore techniques for doing so in the next post.