adventures in optimization

The charming adventures of an analyst and his solver.

« Friday, June 27, 2014 »

Preprocessing for Routing Problems - Part 2

Preprocessing techniques for dividing space based on closest points by driving distance.

In the previous post, we considered preprocessing for the vehicle routing problem where the vehicles have different starting locations. Our goal was to create potentially overlapping regions for the entire US which we could later use for route construction. We defined these regions using all 5-digit zip codes in the continental US for which one of our regional headquarters is the closest, or one of $n$ closest, headquarters in terms of Euclidean distance. The resulting regions gave us some flexibility in terms of how much redundancy we allow in our coverage of the country.

This post refines those regions, replacing the Euclidean distance with a more realistic metric: estimated travel time. Doing this should give us a better sense of how much space a given driver can actually cover. It should also divide the country up more equitably among our drivers.

Our approach here will be similar to that of the last post, but instead of ranking our headquarter-zip pairs by Euclidean distance, we'll rank them by estimated travel time. The catch is that, while the former is easy to compute using the SpatialTools library, we have to request the latter from a third party. In this post, we'll use the MapQuest Route Matrix, since it provides estimates based on OpenStreetMap data to us for free, and doesn't cap the number of requests we can make.

To do this we're going to need a lot of point estimates for location-to-location travel times. In fact, building a full data set for replacing our Euclidean distance ranks would require $37,341 \times 25 = 933,525$ travel time estimates. That's a bit prohibitive. The good news is we don't need to all the data points unless we generate 25 levels of redundancy. We can just request enough travel time estimates to make us reasonably certain that we've got all the necessary data. In the last post we generated regions for 1, 2, and 3 levels of redundancy, so here we'll get travel times for the 10 closest headquarters to each zip code, and take the leap of faith that the closest 3 headquarters by travel time for each zip will be among the 10 closest by Euclidean distance.

Note: I've provided the output data so you don't have to make hundreds of thousands of requests to MapQuest yourself. Instead of running the following code, please download it here. You can then run the section at the end that generates the maps.

Let's assume that you have just executed the code from the last post and have its variables in your current scope. If you need it, you can find that code here. First, we define some constants we're going to need in order to make MapQuest requests.

# Define some constants for making requests to MapQuest and determining
# when to save and what to request.
library(RCurl)
library(rjson)
library(utils)

MAPQUEST_API_KEY <- 'YOUR KEY HERE'
MAPQUEST_API_URL <- 'http://www.mapquestapi.com/directions/v2/routematrix?key=%s&json=%s'
ZIPS_BETWEEN_SAVE <- 250
HQ_RANK_MIN <- 1  # Min/max distance ranks for time estimates
HQ_RANK_MAX <- 10

Now we create a data frame to hold our HQ-to-zip travel estimates. The rows correspond to zip codes and the columns correspond to our headquarter locations. We initialize the data frame to contain no estimates and write it to a CSV file. Since it will take on the order of days for us to fill this file in, we're going to write it out and read it back in periodically. That way we can pick up where we left off by simply rerunning the code in case of an error or loss of network connectivity.

# Write out a blank file containing our time estimates.
TIME_CSV_PATH <- 'hqs_to_zips_time.csv'
if (!file.exists(TIME_CSV_PATH)) {
    # Clear out everything except row and column names.
    empty <- as.data.frame(matrix(nrow=nrow(zips_deduped), ncol=nrow(largest_cities)+1))
    names(empty) <- c('zip', largest_cities$name)
    empty$zip <- zips_deduped$zip

    # This represents our current state.
    write.csv(empty, TIME_CSV_PATH, row.names=F)
}

# Read in our current state in case we are starting over.
hqs_to_zips_time <- read.csv(TIME_CSV_PATH)
hqs_to_zips_time$zip <- sprintf('%05d', hqs_to_zips_time$zip)

# Sanity check: If we have any times = 0, set them to NA so that we re-request them.
hqs_to_zips_time[hqs_to_zips_time <= 0] <- NA

With that file created, we can start making requests to MapQuest's Route Matrix. For each zip code, we are going to request travel times for its 10 closest HQs. We'll save our time estimates data frame every 250 zip codes. Also, we're going to randomize the order of the zip codes so we fill out our data set more evenly as we go. That way we can generate maps during the process or otherwise inspect the data as we go.

# Now we start requesting travel times from MapQuest.
requests_until_save <- ZIPS_BETWEEN_SAVE
col_count <- ncol(hqs_to_zips_time)

# Randomize the zip code order so we fill in the map uniformly as we get more data.
# This will enable us to check on our data over time and make sure it looks right.
for (zip_idx in sample(1:nrow(zips_deduped))) {
    z <- zips_deduped$zip[zip_idx]
    z_lat <- zips_deduped$latitude[zip_idx]
    z_lon <- zips_deduped$longitude[zip_idx]

    # Find PODs for this zip that are in the rank range.
    which_hqs <- which(
        hqs_to_zips_rank[,zip_idx] >= HQ_RANK_MIN &
        hqs_to_zips_rank[,zip_idx] <= HQ_RANK_MAX
    )

    # We're only interested in records that aren't filled in yet.
    na_pods <- is.na(hqs_to_zips_time[zip_idx, which_hqs+1])
    if (length(hqs_to_zips_time[zip_idx,2:col_count][na_pods]) < 1) {
        next
    }

    # Request this block of PODs and fill them all in.
    print(sprintf('requesting: zip=%s rank=[%d-%d]', z, HQ_RANK_MIN, HQ_RANK_MAX))

    # Construct a comma-delimited string of lat/lons containing the locations of our
    # HQs We will use this for our MapQuest requests below: for each zip code, we
    # make one request for its distance to every HQ in our range.
    hq_locations <- paste(
        sprintf("'%f,%f'", largest_cities$lat[which_hqs], largest_cities$long[which_hqs]),
        collapse = ', '
    )

    # TODO: make sure we are requesting from location 1 to 2:n only
    request_json <- URLencode(sprintf(
        "{allToAll: false, locations: ['%f,%f', %s]}",
        z_lat,
        z_lon,
        hq_locations
    ))
    url <- sprintf(MAPQUEST_API_URL, MAPQUEST_API_KEY, request_json)
    result <- fromJSON(getURL(url))

    # If we get back 0s, they should be NA. Otherwise they'll mess up our
    # rankings and region drawing later.
    result$time[result$time <= 0] <- NA

    hqs_to_zips_time[zip_idx, which_hqs+1] <- result$time[2:length(result$distance)]

    # See if we should save our current state.
    requests_until_save <- requests_until_save - 1
    if (requests_until_save < 1) {
        print('saving current state')
        write.csv(hqs_to_zips_time, TIME_CSV_PATH, row.names=F)
        requests_until_save <- ZIPS_BETWEEN_SAVE
    }
}

# Final save once we are done.
write.csv(hqs_to_zips_time, TIME_CSV_PATH, row.names=F)

Assuming you didn't actually run that code and instead opted to download the data, run the following code to read it in.

hqs_to_zips_time <- read.csv(TIME_CSV_PATH)
hqs_to_zips_time$zip <- sprintf('%05d', hqs_to_zips_time$zip)

Now we generate our ranks based on travel time instead of distance. We have to be a bit more careful this time, as we might have incomplete data. We don't want pairs with travel time of NA showing up in the rankings.

# Rank HQs by their distance to each unique zip code location.
hqs_to_zips_rank2 <- matrix(nrow=nrow(largest_cities), ncol=nrow(zips_deduped))
for (i in 1:nrow(zips_deduped)) {
    not_na <- !is.na(hqs_to_zips_time[i,2:ncol(hqs_to_zips_time)])
    hqs_to_zips_rank2[not_na,i] <-
        rank(hqs_to_zips_time[i,2:ncol(hqs_to_zips_time)][not_na], ties.method='first')
}

We build our map for the Dallas, TX headquarter the same way as before.

# Now we draw regions for which Dallas is one of the closest 3 HQs by time.
hq_idx <- which(largest_cities$name == 'Dallas TX')
redundancy_levels <- c(3, 2, 1)
fill_alpha <- c(0.15, 0.30, 0.45)

map('state')
for (i in 1:length(redundancy_levels)) {
    # Find every zip for which this HQ is within n in time rank.
    within_n <- hqs_to_zips_rank2[hq_idx,] <= redundancy_levels[i]

    # Convex hull of zip code points.
    hull_order <- chull(
        zips_deduped$longitude[within_n],
        zips_deduped$latitude[within_n]
    )
    hull_x <- zips_deduped$longitude[within_n][hull_order]
    hull_y <- zips_deduped$latitude[within_n][hull_order]
    polygon(hull_x, hull_y, border='blue', col=rgb(0, 0, 1, fill_alpha[i]))
}

# The other HQs.
other_hqs = 1:nrow(largest_cities) != hq_idx
points(
    largest_cities$long[other_hqs],
    largest_cities$lat[other_hqs],
    pch = 21,
    bg = rgb(0.4, 0.4, 0.4, 0.6),
    col = 'black',
    cex = 1.5
)

# This HQ.
points(
    largest_cities$long[hq_idx],
    largest_cities$lat[hq_idx],
    pch = 21,
    bg = rgb(1, 0, 0, .85),
    col = 'black',
    cex = 1.5
)

This shows the regions for which Dallas is among the closest headquarters for 1, 2, and 3 level of redundancy. Compare this map to the one from the previous post, and you'll see that it conforms better to the highway system. For instance, it takes into account I-20 which moves east and west across Texas, instead of pushing up into the Dakotas.


Travel time-based regions for Dallas, TX

And now our map of the US, showing the regions for each HQ as the set of zip codes for which it is the closest.

# Map of regions where every zip is served only by its closest HQ.
map('usa')
for (hq_idx in 1:nrow(largest_cities)) {
    # Find every zip for which this HQ is the closest.
    within_1 <- hqs_to_zips_rank2[hq_idx,] == 1
    within_1[is.na(within_1)] <- F

    # Convex hull of zip code points.
    hull_order <- chull(
        zips_deduped$longitude[within_1],
        zips_deduped$latitude[within_1]
    )
    hull_x <- zips_deduped$longitude[within_1][hull_order]
    hull_y <- zips_deduped$latitude[within_1][hull_order]
    polygon(
        hull_x,
        hull_y,
        border = 'black',
        col = rgb(0, 0, 1, 0.25)
    )
}

# All HQs
points(
    largest_cities$long,
    largest_cities$lat,
    pch = 21,
    bg = rgb(1, 0, 0, .75),
    col = 'black',
    cex = 1.5
)

This gives us our new map. If we compare this with the original, it should better reflect the topology of the highway system. It also looks a bit less jagged.


All regions with redundancy level of 1

Exercises for the reader:

  • Some of these regions overlap, even though they are supposed to be only composed of zip codes for which a given HQ is the closest. Why is that?
  • Say we want to limit our driver to given maximum travel times. Based on our data from MapQuest, draw concentric circles representing approximate 3, 5, and 7 hour travel time regions.