Updated September 26, 2022 bugfix

The goal of this Python notebook is to download and prepare a suite of benchmark networks for some shortest path algorithms. We would like to experiment with some simple directed graphs with non-negative weights. We are specially interested in road networks.

The files are available on the Universita Di Roma website. It was created for the 9th DIMACS implementation challenge : Implementation Challenge about Shortest Paths. This challenge dates back to 2006, but the files are still there. Here is the download web page : http://www.diag.uniroma1.it//challenge9/download.shtml

The networks correspond to different parts of the USA road network, with various region sizes. Here are the different network names, from smaller to larger:

• NY : New York
• BAY : Bay area
• FLA : Florida
• NW : North West
• NE : North East
• CAL : Califorinia
• LKS : Great Lakes
• E : Eastern region
• W : Western region
• CTR : Central region
• USA : contiguous United States

Known issues: the data has numerous errors, in particular gaps in major highways and bridges. This may result in routes that are very different from real-life ones. One should take this into consideration when experimenting with the data.

This does not really matter to us because the plan is to implement a shortest path algorithm in Python/Cython and to compare various implementations, but not to find realistic routes.

The networks are available as compressed text files. So here are the very simple steps followed in this notebook:

1. create a folder structure
3. uncompress them with gzip
4. Create a function to load the edges into a Pandas dataframe
5. Create a function to load the node coordinates into a Pandas dataframe
6. Save the networks into parquet files
7. Query the parquet files with DuckDB
8. Plot the networks with Datashader

## Imports

import gzip
import os
import shutil

import duckdb
import geopandas as gpd
import numpy as np
import pandas as pd
import wget

DATA_DIR_ROOT_PATH = "/home/francois/Data/Disk_1/"  # root data dir

Package versions:

Python    : 3.10.6
duckdb    : 0.5.0
geopandas : 0.11.1
numpy     : 1.22.4
pandas    : 1.4.4
pyarrow   : 9.0.0
pygeos    : 0.13
wget      : 3.2


## Create a folder structure

We start by creating a main directory DIMACS_road_networks, and then one directory per network.

data_dir_path = os.path.join(DATA_DIR_ROOT_PATH, "DIMACS_road_networks")
if not os.path.exists(data_dir_path):
os.makedirs(data_dir_path)
names = [
"NY",
"BAY",
"COL",
"FLA",
"NW",
"NE",
"CAL",
"LKS",
"E",
"W",
"CTR",
"USA",
]
network_dir_paths = {}
for name in names:
network_dir_path = os.path.join(data_dir_path, name)
if not os.path.exists(network_dir_path):
os.makedirs(network_dir_path)
network_dir_paths[name] = network_dir_path
!tree -d {data_dir_path}
/home/francois/Data/Disk_1/DIMACS_road_networks
├── BAY
├── CAL
├── COL
├── CTR
├── E
├── FLA
├── LKS
├── NE
├── NW
├── NY
├── USA
└── W

12 directories


Three types of files are available from the DIMACS challenge web site:

• Distance graph: edges with distance weight
• Travel time graph: edges with travel time weight
• Coordinates: node coordinates [latitude, longitude]

However, we are only going to download the coordinates and travel time graph files in the present notebook. We only require one type of edge weight in order run shortest path algorithms. So between weight=distance or weight=travel_time, we chose the latter.

The file URL has a neat pattern.

• travel time graph : http://www.diag.uniroma1.it//challenge9/data/USA-road-t/USA-road-t.XXX.gr.gz
• coordinates : http://www.diag.uniroma1.it//challenge9/data/USA-road-d/USA-road-d.XXX.co.gz

where XXX is the network name. So we download each of these files with wget and save them in the respective network folder.

travel_time_graph_file_paths = {}
coordinates_file_paths = {}
for name in names:

# travel time graph
travel_time_graph_file_path = os.path.join(
)
travel_time_graph_file_paths[name] = travel_time_graph_file_path

# coordinates
coordinates_file_path = os.path.join(
)
coordinates_file_paths[name] = coordinates_file_path

if (not os.path.exists(travel_time_graph_file_path)) & (
not os.path.exists(travel_time_graph_file_path.removesuffix(".gz"))
):
if (not os.path.exists(coordinates_file_path)) & (
not os.path.exists(coordinates_file_path.removesuffix(".gz"))
):
wget.download(coordinates_url, coordinates_file_path)

## Uncompress the network files with gzip

We first create a small function that is looking for a zipped file. If found, the zipped file is uncompressed and removed.

def extract_and_cleanup(file_path_gz):
file_path = file_path_gz.split(".gz")[0]
try:
with gzip.open(file_path_gz, "rb") as f_in:
with open(file_path, "wb") as f_out:
shutil.copyfileobj(f_in, f_out)
os.remove(file_path_gz)
except FileNotFoundError:
pass
return file_path

Now we call the above function extract_and_cleanup for each node and edge file:

for name in names:

# travel time graph
file_path_gz = travel_time_graph_file_paths[name]
file_path = extract_and_cleanup(file_path_gz)
travel_time_graph_file_paths[name] = file_path

# coordinates
file_path_gz = coordinates_file_paths[name]
file_path = extract_and_cleanup(file_path_gz)
coordinates_file_paths[name] = file_path

## Create a function to load the edges into a Pandas dataframe

Let's have a look at one of the edge file:

! head -n 10 {travel_time_graph_file_paths["NY"]}
c 9th DIMACS Implementation Challenge: Shortest Paths
c http://www.dis.uniroma1.it/~challenge9
c
p sp 264346 733846
c graph contains 264346 nodes and 733846 arcs
c
a 1 2 2008
a 2 1 2008
a 3 4 395


So we need to skip the header lines, starting either with c or p. Then, on each edge line, we have the letter a, the source node index, the target node index and edge travel time. The edge weight has an int type here and we do not know the time unit. We assume that it corresponds to hundreds of seconds. This also does not matter regarding the comparison of various shortest path implementations.

So once the file is loaded with pandas.read_csv, we perform a few transformation steps:

• set the weight column to float type, and convert the weight from hundreds of seconds to seconds
• remove parallel edges by keeping a single edge with the min weight
• remove loops, if there is any
• shift the source and target indices by -1 in order to be 0-based

The point of removing parallel arcs, is to get a simple directed graphs, that we can represent with an adjacency matrix in a sparse format. Loops could be described in an adjacency matrix but they would be completely useless regarding shortest paths, the edge weights being non-negative.

def read_travel_time_graph(file_path):

with open(file_path) as f:
for line in lines:
if line.startswith("p"):
edge_count = int(line.split(" ")[-1])
elif line.startswith("a"):
break

file_path,
sep=" ",
names=["a", "source", "target", "weight"],
usecols=["source", "target", "weight"],
)

assert len(df) == edge_count

# data preparation and assertions
assert not df.isna().any().any()  # no missing values
df.weight = df.weight.astype(float)  # convert to float type
df.weight *= 0.01  # convert to seconds
assert df.weight.min() >= 0.0  # make sure travel times are non-negative
df = (
df.groupby(["source", "target"], sort=False).min().reset_index()
)  # remove parallel edges and keep the one with shortest weight
df = df[df["source"] != df["target"]]  # remove loops
df[["source", "target"]] -= 1  # switch to 0-based indices

return df

We can try read_travel_time_graph on the NY network file:

file_path = travel_time_graph_file_paths["NY"]
edges_df.head(3)
source target weight
0 0 1 20.08
1 0 11 21.05
2 0 1362 60.70

## Create a function to load the node coordinates into a Pandas dataframe

The node coordinate files are similar to the edge files:

! head -n 10 {coordinates_file_paths["NY"]}
c 9th DIMACS Implementation Challenge: Shortest Paths
c http://www.dis.uniroma1.it/~challenge9
c TIGER/Line nodes coords for graph USA-road-d.NY
c
p aux sp co 264346
c graph contains 264346 nodes
c
v 1 -73530767 41085396
v 2 -73530538 41086098
v 3 -73519366 41048796


We also need to skip the header lines, starting either with c or p. Then, on each edge line, we have the letter v, the node index, the longitude and latitude. From what I understand, the coordinates are expressed in WGS 84, but as int type, in millionth of degrees. So we divide the int longitude and latitude numbers by a million to obtain the coordinates in degrees.

def read_coords(file_path, epsg=4326):

with open(file_path) as f:
for line in lines:
if line.startswith("p"):
vertex_count = int(line.split(" ")[-1])
elif line.startswith("v"):
break

file_path,
sep=" ",
names=["v", "id", "lng", "lat"],
usecols=["id", "lng", "lat"],
)

df["id"] -= 1  # 0-based indices
df[["lng", "lat"]] /= 10**6  # convert the coordinates to degrees

if epsg != 4326:

gpd.options.use_pygeos = True

# load the vertices into a geodataframe
gdf = gpd.GeoDataFrame(
df, geometry=gpd.points_from_xy(df.lng, df.lat), crs="EPSG:4326"  # WGS 84
)
gdf.drop(["lng", "lat"], axis=1, inplace=True)
gdf.set_index("id", inplace=True)
nodes_gs = gdf.geometry
nodes_gs = nodes_gs.to_crs(epsg=epsg)
nodes_gdf = nodes_gs.to_frame("geometry")
nodes_gdf["x"] = nodes_gdf.geometry.x
nodes_gdf["y"] = nodes_gdf.geometry.y
nodes_df = nodes_gdf[["x", "y"]]

else:

nodes_df = df.rename(columns={"lng": "x", "lat": "y"})
nodes_df.set_index("id", inplace=True)

assert len(nodes_df) == vertex_count

return nodes_df

Let's run read_coords on the NY network file:

file_path = coordinates_file_paths["NY"]
nodes_df.head(3)
x y
id
0 -73.530767 41.085396
1 -73.530538 41.086098
2 -73.519366 41.048796

We added an argument epsg with a target CRS. This is, for the case where we would like to use a different CRS than EPSG:4326. GeoPandas is used to transform the data in that case.

nodes_df = read_coords(file_path, epsg=3857)
nodes_df.head(3)
x y
id
0 -8.185408e+06 5.024946e+06
1 -8.185382e+06 5.025049e+06
2 -8.184138e+06 5.019542e+06

## Save the networks into parquet files

Now we call both functions, read_travel_time_graph and read_coords, for all the networks and write the corresponding dataframes as parquet files.

parquet_tt_file_paths = {}
for name in names:
file_path = travel_time_graph_file_paths[name]
parquet_tt_file_path = file_path + ".parquet"
parquet_tt_file_paths[name] = parquet_tt_file_path
if not os.path.exists(parquet_tt_file_path):
edges_df.to_parquet(parquet_tt_file_path)
parquet_coord_file_paths = {}
for name in names:
file_path = coordinates_file_paths[name]
parquet_coord_file_path = file_path + ".parquet"
if not os.path.exists(parquet_coord_file_path):
nodes_df.to_parquet(parquet_coord_file_path)
parquet_coord_file_paths[name] = parquet_coord_file_path

We now have all the parquet files ready for use on the disk!

!tree -P '*.parquet' {data_dir_path}
/home/francois/Data/Disk_1/DIMACS_road_networks
├── BAY
├── CAL
├── COL
├── CTR
├── E
├── FLA
├── LKS
├── NE
├── NW
├── NY
├── USA
└── W

12 directories, 24 files


## Query the parquet files with DuckDB

Although we could have done it earlier when we had the dataframes in our hands, we are now going to perform some basic analysis of the networks. The motivation is to clearly separate the different steps.

We want to count the number of edges and vertices in each network, from the parquet files, without loading all the data into memory, and rather in an efficient way. We are going to compute these network features using some SQL, with DuckDB.

%%time
network_info = []
for name in names:
parquet_tt_file_path = parquet_tt_file_paths[name]
query = f"""SELECT COUNT(*), MAX(source), MAX(target) FROM parquet_scan('{parquet_tt_file_path}')"""
res = duckdb.query(query).fetchall()[0]
edge_count = res[0]
vertex_count = np.max(res[1:3]) + 1
query = f"""
WITH edges AS (SELECT
source,
target
FROM
parquet_scan('{parquet_tt_file_path}'))
SELECT
COUNT(DISTINCT node)
FROM
(     SELECT
source AS node
FROM
edges
UNION ALL
SELECT
target AS node
FROM
edges
)"""
used_vertices = duckdb.query(query).fetchone()[0]
mean_degree = edge_count / used_vertices
network_info.append(
{
"name": name,
"vertex_count": vertex_count,
"used_vertices": used_vertices,
"edge_count": edge_count,
"mean_degree": mean_degree,
}
)
CPU times: user 1min 16s, sys: 3.96 s, total: 1min 20s
Wall time: 43.6 s

network_info_df = pd.DataFrame(network_info).set_index("name")
network_info_df = network_info_df.sort_values(by=["edge_count"])
pd.set_option("display.precision", 2)
network_info_df
vertex_count used_vertices edge_count mean_degree
name
NY 264346 264346 730100 2.76
BAY 321270 321270 794830 2.47
COL 435666 435666 1042400 2.39
FLA 1070376 1070376 2687902 2.51
NW 1207945 1207945 2820774 2.34
NE 1524453 1524453 3868020 2.54
CAL 1890815 1890815 4630444 2.45
LKS 2758119 2758119 6794808 2.46
E 3598623 3598623 8708058 2.42
W 6262104 6262104 15119284 2.41
CTR 14081816 14081816 33866826 2.41
USA 23947347 23947347 57708624 2.41

We can observe that all the vertices are actually used in the graph. The mean degree does not vary too much, although it it a little larger in the NY area, which is densely populated.

## Plot the networks with Datashader

Finally, we are going to plot some of these networks using Datashader.

But before that, we also want to print a rough approximation of the network width and height. Because we have the coordinates expressed in WGS84, let's write a small function to compute an approximate distance in meters between two points, given their lat/lon coordinates. This is a straight implementation of Haversine formula.

def haversine_distance(lon_1, lat_1, lon_2, lat_2):
"""Calculate the distance between two points using their latitude and longitude."""

a = (
np.sin(0.5 * delta_phi) ** 2
+ np.cos(phi_1) * np.cos(phi_2) * np.sin(0.5 * delta_lambda) ** 2
)
c = 2.0 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))

R = 6.371e6  # approximate earth radius
d_m = R * c

return d_m

Now the following load_and_plot function is actually loading the vertex and edge dataframes from the respective parquet files. Then both are used in Datashader's connect_edges function. The Datashader part is strongly inspired from its documentation about networks.

def load_and_plot(
name,
parquet_coord_file_paths,
parquet_tt_file_paths,
network_info_df,
plot_width=1200,
):

parquet_coord_file_path = parquet_coord_file_paths[name]
parquet_tt_file_path = parquet_tt_file_paths[name]
edges_df = edges_df.astype(np.uintc)

# compute the network width and height
xr = nodes_df.x.min(), nodes_df.x.max()
yr = nodes_df.y.min(), nodes_df.y.max()
width_km = 0.001 * haversine_distance(xr[0], yr[0], xr[1], yr[0])
height_km = 0.001 * haversine_distance(xr[0], yr[0], xr[0], yr[1])
print(f"Network : {name}")
print(
f"width : {int(round(width_km)):6d} km, height : {int(round(height_km)):6d} km"
)
vertex_count = network_info_df.loc[name, "vertex_count"]
edge_count = network_info_df.loc[name, "edge_count"]
print(f"vertex count : {vertex_count}, edge count : {edge_count}")

# plot the network
edges_ds = connect_edges(nodes_df, edges_df)
plot_height = int(plot_width * (yr[1] - yr[0]) / (xr[1] - xr[0]))
cvsopts = dict(plot_height=plot_height, plot_width=plot_width)
canvas = ds.Canvas(x_range=xr, y_range=yr, **cvsopts)
px=0,
)

return ep
load_and_plot("NY", parquet_coord_file_paths, parquet_tt_file_paths, network_info_df)
Network : NY
width :     85 km, height :    111 km
vertex count : 264346, edge count : 730100


load_and_plot("BAY", parquet_coord_file_paths, parquet_tt_file_paths, network_info_df)
Network : BAY
width :    178 km, height :    222 km
vertex count : 321270, edge count : 794830


load_and_plot("COL", parquet_coord_file_paths, parquet_tt_file_paths, network_info_df)
Network : COL
width :    621 km, height :    445 km
vertex count : 435666, edge count : 1042400


load_and_plot("FLA", parquet_coord_file_paths, parquet_tt_file_paths, network_info_df)
Network : FLA
width :    753 km, height :    680 km
vertex count : 1070376, edge count : 2687902


load_and_plot("NW", parquet_coord_file_paths, parquet_tt_file_paths, network_info_df)
Network : NW
width :    720 km, height :    779 km
vertex count : 1207945, edge count : 2820774


load_and_plot("NE", parquet_coord_file_paths, parquet_tt_file_paths, network_info_df)
Network : NE
width :    520 km, height :    389 km
vertex count : 1524453, edge count : 3868020


load_and_plot("CAL", parquet_coord_file_paths, parquet_tt_file_paths, network_info_df)
Network : CAL
width :    976 km, height :   1056 km
vertex count : 1890815, edge count : 4630444


load_and_plot("LKS", parquet_coord_file_paths, parquet_tt_file_paths, network_info_df)
Network : LKS
width :   1591 km, height :    827 km
vertex count : 2758119, edge count : 6794808


load_and_plot("E", parquet_coord_file_paths, parquet_tt_file_paths, network_info_df)
Network : E
width :   1116 km, height :   1543 km
vertex count : 3598623, edge count : 8708058


load_and_plot("W", parquet_coord_file_paths, parquet_tt_file_paths, network_info_df)
Network : W
width :   2422 km, height :   2331 km
vertex count : 6262104, edge count : 15119284


load_and_plot("CTR", parquet_coord_file_paths, parquet_tt_file_paths, network_info_df)
Network : CTR
width :   2114 km, height :   2669 km
vertex count : 14081816, edge count : 33866826


We note that it is hard to distinguish anything in the last plot because of the network large size, with a fixed size plot and a fixed edge width. Also I couldn't plot the largest one, USA, for some memory reasons.

The networks are now ready to be use by some Shortest Path algorithms, which will be the subject of some future posts.