doc/project_overview/index.xhtml

   1 <?xml version="1.0" encoding="utf-8" ?>
   2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
   3
   4 <html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
   5
   6 <head>
   7   <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
   8   <title>Project Overview</title>
   9
  10   <link rel="stylesheet" type="text/css" href="reset.css" />
  11   <link rel="stylesheet" type="text/css" href="common.css" />
  12 </head>
  13
  14 <body>
  15
  16   <h1>Project Overview</h1>
  17   <p id="project_title">
  18     Chlorine Transportation Optimization for the Chemical Security
  19     Analysis Center
  20   </p>
  21
  22   <h2>General Goals</h2>
  23   <p>
  24     The overall goal of the census-tools project is to provide the
  25     applied math lab (AML) with a set of libraries and utilities
  26     necessary to complete its assignment.
  27   </p>
  28
  29   <p>
  30     One of the foremost goals that must be achieved is to model the
  31     average population density throughout the United States. Using
  32     this data, we would like to be able to calculate the risk
  33     associated with an <q>event</q> taking place somewhere in the
  34     United States. This will, in general, be an accident or other
  35     unexpected event that causes some damage to the surrounding
  36     population and environment.
  37   </p>
  38
  39   <p>
  40     Our canonical example (for which the project was created) is that
  41     of a chlorine spill. If chlorine is spilled or otherwise released
  42     somewhere in the United States, we would like to be able to
  43     calculate the number of people affected.
  44   </p>
  45
  46   <h2>Specific Goals</h2>
  47   <ul>
  48     <li>
  49       Be able to determine the average population density at any point
  50       within the United States, given by GPS coordinates.
  51     </li>
  52
  53     <li>
  54       Determine the total population within a subset of the U.S. This
  55       subset will be defined by a polygon or other two-dimensional
  56       (possibly piecewise-defined) function.
  57     </li>
  58
  59     <li>
  60       Calculate the path that roads follow throughout the
  61       U.S. Ideally, the path would be calculated in terms of GPS
  62       coordinates within some <a
  63       href="http://en.wikipedia.org/wiki/SRID">spatial reference
  64       system</a>.
  65     </li>
  66   </ul>
  67
  68   <h2>Software Requirements</h2>
  69   <p>
  70     The following software is required to utilize census-tools in
  71     full. It should be possible to reuse the code independently of the
  72     other software, but many of the features such as database
  73     integration and build system automation require third-party
  74     programs.
  75   </p>
  76
  77   <ol>
  78     <li>
  79       The build system utilizes <a
  80       href="http://www.gnu.org/software/make/">GNU Make</a>. The
  81       makefile simply automates some processes which can be run
  82       manually: executing tests, downloading data, importing
  83       shapefiles, etc. Other (non-GNU) versions of Make should also
  84       work.
  85
  86       <ol>
  87         <li>
  88           The Make system assumes that the user has a POSIX-compatible
  89           shell. The author has tested the makefiles with <a
  90           href="http://en.wikipedia.org/wiki/Bash">Bash</a>, but other
  91           shells such as <a
  92           href="http://en.wikipedia.org/wiki/Cygwin">Cygwin</a> should
  93           suffice.
  94         </li>
  95
  96         <li>
  97           <a href="http://en.wikipedia.org/wiki/Wget">wget</a> is used
  98           to download the Census shapefiles.
  99         </li>
 100       </ol>
 101     </li>
 102
 103     <li>
 104       The underlying database management system is <a
 105       href="http://www.postgresql.org/">Postgresql</a>.
 106     </li>
 107
 108     <li>
 109       The GIS capabilities are provided (to Postgresql) by <a
 110       href="http://postgis.refractions.net/">PostGIS</a>.
 111     </li>
 112   </ol>
 113
 114   <h2>Census Databases</h2>
 115   <p>
 116     There are two Census databases (obtained as flat files) which we
 117     require for our purposes. The first is <a
 118     href="http://www.census.gov/Press-Release/www/2001/sumfile1.html">Summary
 119     File 1</a>, which contains the block-level demographic
 120     information. The second database, <a
 121     href="http://www.census.gov/geo/www/tiger/">TIGER/Line</a>, gives
 122     us the geometry of those block.
 123   </p>
 124
 125   <h3>Summary File 1</h3>
 126
 127   <p>
 128     There are a number of tables contained within the Summary File 1
 129     database. We require only one of these: the geographic header
 130     records. The Summary File 1 geographic header records provide,
 131     among other things, the following:
 132   </p>
 133
 134   <ul>
 135     <li>The GPS coordinates of each block's centroid.</li>
 136     <li>The total population contained within a block.</li>
 137     <li>The total land/water area of each block.</li>
 138     <li>
 139       Block/state/county/tract identifiers which combine to form a
 140       unique identifier.
 141     </li>
 142   </ul>
 143
 144
 145   <h3>TIGER/Line</h3>
 146
 147   <p>
 148     The TIGER/Line information comes in the form of <a
 149     href="http://en.wikipedia.org/wiki/Shapefile">ESRI
 150     shapefiles</a>. These shapefiles contain the geometric information
 151     about the Census blocks.
 152   </p>
 153
 154   <p>
 155     Each TIGER/Line record contains,
 156   </p>
 157
 158   <ul>
 159     <li>
 160       The same block/state/county/tract information as the Summary
 161       File 1 geographic header records.
 162     </li>
 163
 164     <li>
 165       A redundant field, called <q>blkidfp00</q>, which contains the
 166       concatenation of block/state/county/tract. This is our unique
 167       identifier.
 168     </li>
 169
 170     <li>
 171       A geometry column containing a set of points, lines, and
 172       polygons that define the boundaries of the block.
 173     </li>
 174   </ul>
 175
 176
 177   <h3>Summary File 1 / TIGER Integration</h3>
 178
 179   <p>
 180     We need to correlate the TIGER/Line geometric information with the
 181     demographic information contained in the Summary File 1 geographic
 182     header records. To do this, we need to rely on the unique
 183     <q>blkidfp00</q> identifier.
 184   </p>
 185
 186   <p>
 187     For the TIGER table, this field has already been calculated. We
 188     calculate it manually for the geographic header records. Then, we
 189     can define a one-to-one correspondence between geographic header
 190     records and TIGER/Line records. This satisfies one of our goals:
 191     correlating the two tables allows us to determine the population
 192     density at any point.
 193   </p>
 194
 195
 196   <h2>Importing the Data</h2>
 197
 198   <p>
 199     <em>
 200       Note: the <a href="../../makefile">makefile</a> provides a task
 201       for creation/import of the databases, but its use is strictly
 202       required.
 203     </em>
 204   </p>
 205
 206
 207   <h3>Creation of the Database</h3>
 208
 209   <p>
 210     A Postgres/PostGIS database is required to store our Census
 211     data. The database name is unimportant (default: <q>census</q>),
 212     but several of the scripts refer to the table names. For
 213     simplicity, we will call the database <q>census</q> from now on.
 214   </p>
 215
 216   <p>
 217     Once the database has been created, we need to import two PostGIS
 218     tables so that we can support the GIS functionality. These two
 219     files are <q>lwpostgis.sql</q> and
 220     <q>spatial_ref_sys.sql</q>. See the <a
 221     href="../../makefile">makefile</a> for an example of their import.
 222   </p>
 223
 224   <p>
 225     Before we import any data, we need to develop a database schema
 226     that models our Census data. The TIGER/Line schema is provided for
 227     us, so we have modeled the Summary File 1 schema in a similar
 228     fashion.
 229   </p>
 230
 231   <p>
 232     The Summary File 1 table schema is <a
 233     href="../../sql/create-sf1_blocks-table.sql">defined in a standard
 234     SQL script</a>. The following steps will all require that a
 235     database exist, and contain the table defined by this script.
 236   </p>
 237
 238
 239   <h3>Importing the Summary File 1 Block Records</h3>
 240
 241   <p>
 242     The Summary File 1 data are obtained as flat files. We therefore
 243     utilize a Python script, <a href="../../bin/sf1blocks2sql"></a>,
 244     to parse and import the blocks. The source code to the script is
 245     documented, but basically, it performs three tasks: it parses each
 246     record from the geographic header record file, filters out any
 247     non-block records, and then writes out a SQL statement to insert
 248     that record in to the Postgres/PostGIS database.
 249   </p>
 250
 251
 252   <h3>Importing the TIGER/Line Shapefiles</h3>
 253
 254   <p>
 255     Since the shapefiles are in a standard format, we can use
 256     pre-existing tools to import the data in to our SQL
 257     database. PostGIS provides a binary, <q>shp2pgsql</q>, that will
 258     parse and convert the shapefiles to SQL.
 259   </p>
 260
 261   <p>
 262     There is one caveat here: the <q>shp2pgsql</q> program requires
 263     an SRID as an argument; this SRID is assigned to each record it
 264     imports. We have designated an SRID of 4269, which denotes
 265     <q>NAD83</q>, or the North American Datum (1983). There may be
 266     some possibility of improvement here; since we are only
 267     considering the Mid-Atlantic, there may be coordinate systems
 268     (SRIDs) which are more accurate in our subset of the United
 269     States.
 270   </p>
 271
 272   <h2>Possible Optimizations</h2>
 273   <p>
 274     There are a number of possible optimizations that can be made
 275     should performance ever become prohibitive. To date, these have
 276     been eschewed for lack of flexibility and/or development time.
 277   </p>
 278
 279   <h3>De-normalization of TIGER/SF1 Block Data</h3>
 280   <p>
 281     Currently, the TIGER/Line block data is stored in a separate table
 282     from the Summary File 1 block data. The two are combined at query
 283     time via <a href="http://en.wikipedia.org/wiki/Join_(SQL)">SQL
 284     JOIN</a>s. Since we import the TIGER data first, and use a custom
 285     import script for SF1, we could <a
 286     href="http://en.wikipedia.org/wiki/Denormalization">de-normalize</a>
 287     this design to increase query speed.
 288   </p>
 289
 290   <p>
 291     This would slow down the SF1 import, of course; but the import
 292     only needs to be performed once. The procedure would look like the
 293     following:
 294   </p>
 295
 296   <ol>
 297     <li>
 298       Add the SF1 columns to the TIGER table, allowing them to be
 299       nullable initially (since they will all be NULL at first).
 300     </li>
 301
 302     <li>
 303       Within the SF1 import, we would,
 304       <ol>
 305         <li>Parse a block</li>
 306         <li>
 307           Use that block's blkidfp00 to find the corresponding row in
 308           the TIGER table.
 309         </li>
 310         <li>
 311           Update the TIGER row with the values from SF1.
 312         </li>
 313       </ol>
 314     </li>
 315
 316     <li>
 317       Optionally set the SF1 columns to NOT NULL. This may have
 318       <em>some</em> performance benefit, but I wouldn't count on it.
 319     </li>
 320
 321     <li>
 322       Fix all the SQL queries to use the schema.
 323     </li>
 324   </ol>
 325
 326
 327   <h3>Switch from GiST to GIN Indexes</h3>
 328   <p>
 329     When the TIGER data is imported via <q>shp2pgsql</q>, a <a
 330     href="http://www.postgresql.org/docs/8.4/static/textsearch-indexes.html">GiST</a>
 331     index is added to the geometry column by means of the
 332     <strong>-I</strong> flag. This improves the performance of the
 333     population calculations by a (wildly-estimates) order of
 334     magnitude.
 335   </p>
 336
 337   <p>
 338     Postgres, however, offers another type of similar index &mdash;
 339     the <a
 340     href="http://www.postgresql.org/docs/8.4/static/textsearch-indexes.html">GIN
 341     Index</a>. If performance degrades beyond what is acceptable, it
 342     may be worth evaluating the benefit of a GIN index versus the GiST
 343     one.
 344   </p>
 345
 346 </body>
 347 </html>