Project Overview

Chlorine Transportation Optimization for the Chemical Security Analysis Center

General Goals

The overall goal of the census-tools project is to provide the applied math lab (AML) with a set of libraries and utilities necessary to complete its assignment.

One of the foremost goals that must be achieved is to model the average population density throughout the United States. Using this data, we would like to be able to calculate the risk associated with an event taking place somewhere in the United States. This will, in general, be an accident or other unexpected event that causes some damage to the surrounding population and environment.

Our canonical example (for which the project was created) is that of a chlorine spill. If chlorine is spilled or otherwise released somewhere in the United States, we would like to be able to calculate the number of people affected.

Specific Goals

Be able to determine the average population density at any point within the United States, given by GPS coordinates.
Determine the total population within a subset of the U.S. This subset will be defined by a polygon or other two-dimensional (possibly piecewise-defined) function.
Calculate the path that roads follow throughout the U.S. Ideally, the path would be calculated in terms of GPS coordinates within some spatial reference system.

Software Requirements

The following software is required to utilize census-tools in full. It should be possible to reuse the code independently of the other software, but many of the features such as database integration and build system automation require third-party programs.

Most of the application code is written in Python, and so the Python runtime is required to run it.
1. We utilize a third-party library called Shapely for Python/GEOS integration. GEOS is required by PostGIS (see below), so it is not listed as a separate requirement, even though Shapely does depend on it.
The build system utilizes GNU Make. The makefile simply automates some processes which can be run manually: executing tests, downloading data, importing shapefiles, etc. Other (non-GNU) versions of Make should also work.
1. The Make system assumes that the user has a POSIX-compatible shell. The author has tested the makefiles with Bash, but other shells such as Cygwin should suffice.
2. wget is used to download the Census shapefiles.
The underlying database management system is Postgresql.
The GIS capabilities are provided (to Postgresql) by PostGIS.

Census Databases

There are two Census databases (obtained as flat files) which we require for our purposes. The first is Summary File 1, which contains the block-level demographic information. The second database, TIGER/Line, gives us the geometry of those block.

Summary File 1

There are a number of tables contained within the Summary File 1 database. We require only one of these: the geographic header records. The Summary File 1 geographic header records provide, among other things, the following:

The GPS coordinates of each block's centroid.
The total population contained within a block.
The total land/water area of each block.
Block/state/county/tract identifiers which combine to form a unique identifier.

TIGER/Line

The TIGER/Line information comes in the form of ESRI shapefiles. These shapefiles contain the geometric information about the Census blocks.

Each TIGER/Line record contains,

The same block/state/county/tract information as the Summary File 1 geographic header records.
A redundant field, called blkidfp00, which contains the concatenation of block/state/county/tract. This is our unique identifier.
A geometry column containing a set of points, lines, and polygons that define the boundaries of the block.

Summary File 1 / TIGER Integration

We need to correlate the TIGER/Line geometric information with the demographic information contained in the Summary File 1 geographic header records. To do this, we need to rely on the unique blkidfp00 identifier.

For the TIGER table, this field has already been calculated. We calculate it manually for the geographic header records. Then, we can define a one-to-one correspondence between geographic header records and TIGER/Line records. This satisfies one of our goals: correlating the two tables allows us to determine the population density at any point.

Importing the Data

Note: the makefile provides a task for creation/import of the databases, but its use is not strictly required.

Creation of the Database

A Postgres/PostGIS database is required to store our Census data. The database name is unimportant (default: census), but several of the scripts refer to the table names. For simplicity, we will call the database census from now on.

Once the database has been created, we need to import two PostGIS tables so that we can support the GIS functionality. These two files are lwpostgis.sql and spatial_ref_sys.sql. See the makefile for an example of their import.

Before we import any data, we need to develop a database schema that models our Census data. The TIGER/Line schema is provided for us, so we have modeled the Summary File 1 schema in a similar fashion.

The Summary File 1 table schema is defined in a standard SQL script. The following steps will all require that a database exist, and contain the table defined by this script.

Importing the Summary File 1 Block Records

The Summary File 1 data are obtained as flat files. We therefore utilize a Python script, , to parse and import the blocks. The source code to the script is documented, but basically, it performs three tasks: it parses each record from the geographic header record file, filters out any non-block records, and then writes out a SQL statement to insert that record in to the Postgres/PostGIS database.

Importing the TIGER/Line Shapefiles

Since the shapefiles are in a standard format, we can use pre-existing tools to import the data in to our SQL database. PostGIS provides a binary, shp2pgsql, that will parse and convert the shapefiles to SQL.

There is one caveat here: the shp2pgsql program requires an SRID as an argument; this SRID is assigned to each record it imports. We have designated an SRID of 4269, which denotes NAD83, or the North American Datum (1983). There may be some possibility of improvement here; since we are only considering the Mid-Atlantic, there may be coordinate systems (SRIDs) which are more accurate in our subset of the United States.

Possible Optimizations

There are a number of possible optimizations that can be made should performance ever become prohibitive. To date, these have been eschewed for lack of flexibility and/or development time.

De-normalization of TIGER/SF1 Block Data

Currently, the TIGER/Line block data is stored in a separate table from the Summary File 1 block data. The two are combined at query time via SQL JOINs. Since we import the TIGER data first, and use a custom import script for SF1, we could de-normalize this design to increase query speed.

This would slow down the SF1 import, of course; but the import only needs to be performed once. The procedure would look like the following:

Add the SF1 columns to the TIGER table, allowing them to be nullable initially (since they will all be NULL at first).
Within the SF1 import, we would,
1. Parse a block
2. Use that block's blkidfp00 to find the corresponding row in the TIGER table.
3. Update the TIGER row with the values from SF1.
Optionally set the SF1 columns to NOT NULL. This may have some performance benefit, but I wouldn't count on it.
Fix all the SQL queries to use the schema.

Switch from GiST to GIN Indexes

When the TIGER data is imported via shp2pgsql, a GiST index is added to the geometry column by means of the -I flag. This improves the performance of the population calculations by a (wildly-estimates) order of magnitude.

Postgres, however, offers another type of similar index — the GIN Index. If performance degrades beyond what is acceptable, it may be worth evaluating the benefit of a GIN index versus the GiST one.

Related Projects

Transportation Routing Analysis Geographic Information System (TRAGIS)

TRAGIS is developed by the Geographic Information Science & Tech Group at Oak Ridge National Laboratory. From the abstract,

The Transportation Routing Analysis Geographic Information System (TRAGIS) model is used to calculate highway, rail, or waterway routes within the United States. TRAGIS is a client-server application with the user interface and map data files residing on the user’s personal computer and the routing engine and network data files on a network server. The user’s manual provides documentation on installation and the use of the many features of the model.

It is of interest to us because it provides routing subject to HAZMAT and many other restrictions. Essentially, one provides TRAGIS with starting point A and an end point B; it then calculates the best route between A and B, subject to a number of constraints.

Unfortunately, TRAGIS is a client/server application. Most of the interesting logic and data are stored on the server, to which we don't have access. The User Manual does provide some clues as to where they have obtained their data, though.