doc/project_overview/index.xhtml

   1 <?xml version="1.0" encoding="utf-8" ?>
   2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
   3
   4 <html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
   5
   6 <head>
   7   <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
   8   <title>Project Overview</title>
   9
  10   <link rel="stylesheet" type="text/css" href="reset.css" />
  11   <link rel="stylesheet" type="text/css" href="common.css" />
  12 </head>
  13
  14 <body>
  15
  16   <h1>Project Overview</h1>
  17   <p id="project_title">
  18     Chlorine Transportation Optimization for the Chemical Security
  19     Analysis Center
  20   </p>
  21
  22   <h2>General Goals</h2>
  23   <p>
  24     The overall goal of the census-tools project is to provide the
  25     applied math lab (AML) with a set of libraries and utilities
  26     necessary to complete its assignment.
  27   </p>
  28
  29   <p>
  30     One of the foremost goals that must be achieved is to model the
  31     average population density throughout the United States. Using
  32     this data, we would like to be able to calculate the risk
  33     associated with an <q>event</q> taking place somewhere in the
  34     United States. This will, in general, be an accident or other
  35     unexpected event that causes some damage to the surrounding
  36     population and environment.
  37   </p>
  38
  39   <p>
  40     Our canonical example (for which the project was created) is that
  41     of a chlorine spill. If chlorine is spilled or otherwise released
  42     somewhere in the United States, we would like to be able to
  43     calculate the number of people affected.
  44   </p>
  45
  46   <h2>Specific Goals</h2>
  47   <ul>
  48     <li>
  49       Be able to determine the average population density at any point
  50       within the United States, given by GPS coordinates.
  51     </li>
  52
  53     <li>
  54       Determine the total population within a subset of the U.S. This
  55       subset will be defined by a polygon or other two-dimensional
  56       (possibly piecewise-defined) function.
  57     </li>
  58
  59     <li>
  60       Calculate the path that roads follow throughout the
  61       U.S. Ideally, the path would be calculated in terms of GPS
  62       coordinates within some <a
  63       href="http://en.wikipedia.org/wiki/SRID">spatial reference
  64       system</a>.
  65     </li>
  66   </ul>
  67
  68   <h2>Software Requirements</h2>
  69   <p>
  70     The following software is required to utilize census-tools in
  71     full. It should be possible to reuse the code independently of the
  72     other software, but many of the features such as database
  73     integration and build system automation require third-party
  74     programs.
  75   </p>
  76
  77   <ol>
  78     <li>
  79       Most of the application code is written in <a
  80       href="http://www.python.org/">Python</a>, and so the Python
  81       runtime is required to run it.
  82
  83       <ol>
  84         <li>
  85           We utilize a third-party library called <a
  86           href="http://pypi.python.org/pypi/Shapely">Shapely</a> for
  87           Python/GEOS integration. GEOS is required by PostGIS (see
  88           below), so it is not listed as a separate requirement, even
  89           though Shapely does depend on it.
  90         </li>
  91       </ol>
  92     </li>
  93
  94     <li>
  95       The build system utilizes <a
  96       href="http://www.gnu.org/software/make/">GNU Make</a>. The
  97       makefile simply automates some processes which can be run
  98       manually: executing tests, downloading data, importing
  99       shapefiles, etc. Other (non-GNU) versions of Make should also
 100       work.
 101
 102       <ol>
 103         <li>
 104           The Make system assumes that the user has a POSIX-compatible
 105           shell. The author has tested the makefiles with <a
 106           href="http://en.wikipedia.org/wiki/Bash">Bash</a>, but other
 107           shells such as <a
 108           href="http://en.wikipedia.org/wiki/Cygwin">Cygwin</a> should
 109           suffice.
 110         </li>
 111
 112         <li>
 113           <a href="http://en.wikipedia.org/wiki/Wget">wget</a> is used
 114           to download the Census shapefiles.
 115         </li>
 116       </ol>
 117     </li>
 118
 119     <li>
 120       The underlying database management system is <a
 121       href="http://www.postgresql.org/">Postgresql</a>.
 122     </li>
 123
 124     <li>
 125       The GIS capabilities are provided (to Postgresql) by <a
 126       href="http://postgis.refractions.net/">PostGIS</a>.
 127     </li>
 128   </ol>
 129
 130   <h2>Census Databases</h2>
 131   <p>
 132     There are two Census databases (obtained as flat files) which we
 133     require for our purposes. The first is <a
 134     href="http://www.census.gov/Press-Release/www/2001/sumfile1.html">Summary
 135     File 1</a>, which contains the block-level demographic
 136     information. The second database, <a
 137     href="http://www.census.gov/geo/www/tiger/">TIGER/Line</a>, gives
 138     us the geometry of those block.
 139   </p>
 140
 141   <h3>Summary File 1</h3>
 142
 143   <p>
 144     There are a number of tables contained within the Summary File 1
 145     database. We require only one of these: the geographic header
 146     records. The Summary File 1 geographic header records provide,
 147     among other things, the following:
 148   </p>
 149
 150   <ul>
 151     <li>The GPS coordinates of each block's centroid.</li>
 152     <li>The total population contained within a block.</li>
 153     <li>The total land/water area of each block.</li>
 154     <li>
 155       Block/state/county/tract identifiers which combine to form a
 156       unique identifier.
 157     </li>
 158   </ul>
 159
 160
 161   <h3>TIGER/Line</h3>
 162
 163   <p>
 164     The TIGER/Line information comes in the form of <a
 165     href="http://en.wikipedia.org/wiki/Shapefile">ESRI
 166     shapefiles</a>. These shapefiles contain the geometric information
 167     about the Census blocks.
 168   </p>
 169
 170   <p>
 171     Each TIGER/Line record contains,
 172   </p>
 173
 174   <ul>
 175     <li>
 176       The same block/state/county/tract information as the Summary
 177       File 1 geographic header records.
 178     </li>
 179
 180     <li>
 181       A redundant field, called <q>blkidfp00</q>, which contains the
 182       concatenation of block/state/county/tract. This is our unique
 183       identifier.
 184     </li>
 185
 186     <li>
 187       A geometry column containing a set of points, lines, and
 188       polygons that define the boundaries of the block.
 189     </li>
 190   </ul>
 191
 192
 193   <h3>Summary File 1 / TIGER Integration</h3>
 194
 195   <p>
 196     We need to correlate the TIGER/Line geometric information with the
 197     demographic information contained in the Summary File 1 geographic
 198     header records. To do this, we need to rely on the unique
 199     <q>blkidfp00</q> identifier.
 200   </p>
 201
 202   <p>
 203     For the TIGER table, this field has already been calculated. We
 204     calculate it manually for the geographic header records. Then, we
 205     can define a one-to-one correspondence between geographic header
 206     records and TIGER/Line records. This satisfies one of our goals:
 207     correlating the two tables allows us to determine the population
 208     density at any point.
 209   </p>
 210
 211
 212   <h2>Importing the Data</h2>
 213
 214   <p>
 215     <em>
 216       Note: the <a href="../../makefile">makefile</a> provides a task
 217       for creation/import of the databases, but its use is not
 218       strictly required.
 219     </em>
 220   </p>
 221
 222
 223   <h3>Creation of the Database</h3>
 224
 225   <p>
 226     A Postgres/PostGIS database is required to store our Census
 227     data. The database name is unimportant (default: <q>census</q>),
 228     but several of the scripts refer to the table names. For
 229     simplicity, we will call the database <q>census</q> from now on.
 230   </p>
 231
 232   <p>
 233     Once the database has been created, we need to import two PostGIS
 234     tables so that we can support the GIS functionality. These two
 235     files are <q>lwpostgis.sql</q> and
 236     <q>spatial_ref_sys.sql</q>. See the <a
 237     href="../../makefile">makefile</a> for an example of their import.
 238   </p>
 239
 240   <p>
 241     Before we import any data, we need to develop a database schema
 242     that models our Census data. The TIGER/Line schema is provided for
 243     us, so we have modeled the Summary File 1 schema in a similar
 244     fashion.
 245   </p>
 246
 247   <p>
 248     The Summary File 1 table schema is <a
 249     href="../../sql/create-sf1_blocks-table.sql">defined in a standard
 250     SQL script</a>. The following steps will all require that a
 251     database exist, and contain the table defined by this script.
 252   </p>
 253
 254
 255   <h3>Importing the Summary File 1 Block Records</h3>
 256
 257   <p>
 258     The Summary File 1 data are obtained as flat files. We therefore
 259     utilize a Python script, <a href="../../bin/sf1blocks2sql"></a>,
 260     to parse and import the blocks. The source code to the script is
 261     documented, but basically, it performs three tasks: it parses each
 262     record from the geographic header record file, filters out any
 263     non-block records, and then writes out a SQL statement to insert
 264     that record in to the Postgres/PostGIS database.
 265   </p>
 266
 267
 268   <h3>Importing the TIGER/Line Shapefiles</h3>
 269
 270   <p>
 271     Since the shapefiles are in a standard format, we can use
 272     pre-existing tools to import the data in to our SQL
 273     database. PostGIS provides a binary, <q>shp2pgsql</q>, that will
 274     parse and convert the shapefiles to SQL.
 275   </p>
 276
 277   <p>
 278     There is one caveat here: the <q>shp2pgsql</q> program requires
 279     an SRID as an argument; this SRID is assigned to each record it
 280     imports. We have designated an SRID of 4269, which denotes
 281     <q>NAD83</q>, or the North American Datum (1983). There may be
 282     some possibility of improvement here; since we are only
 283     considering the Mid-Atlantic, there may be coordinate systems
 284     (SRIDs) which are more accurate in our subset of the United
 285     States.
 286   </p>
 287
 288   <h2>Possible Optimizations</h2>
 289   <p>
 290     There are a number of possible optimizations that can be made
 291     should performance ever become prohibitive. To date, these have
 292     been eschewed for lack of flexibility and/or development time.
 293   </p>
 294
 295   <h3>De-normalization of TIGER/SF1 Block Data</h3>
 296   <p>
 297     Currently, the TIGER/Line block data is stored in a separate table
 298     from the Summary File 1 block data. The two are combined at query
 299     time via <a href="http://en.wikipedia.org/wiki/Join_(SQL)">SQL
 300     JOIN</a>s. Since we import the TIGER data first, and use a custom
 301     import script for SF1, we could <a
 302     href="http://en.wikipedia.org/wiki/Denormalization">de-normalize</a>
 303     this design to increase query speed.
 304   </p>
 305
 306   <p>
 307     This would slow down the SF1 import, of course; but the import
 308     only needs to be performed once. The procedure would look like the
 309     following:
 310   </p>
 311
 312   <ol>
 313     <li>
 314       Add the SF1 columns to the TIGER table, allowing them to be
 315       nullable initially (since they will all be NULL at first).
 316     </li>
 317
 318     <li>
 319       Within the SF1 import, we would,
 320       <ol>
 321         <li>Parse a block</li>
 322         <li>
 323           Use that block's blkidfp00 to find the corresponding row in
 324           the TIGER table.
 325         </li>
 326         <li>
 327           Update the TIGER row with the values from SF1.
 328         </li>
 329       </ol>
 330     </li>
 331
 332     <li>
 333       Optionally set the SF1 columns to NOT NULL. This may have
 334       <em>some</em> performance benefit, but I wouldn't count on it.
 335     </li>
 336
 337     <li>
 338       Fix all the SQL queries to use the schema.
 339     </li>
 340   </ol>
 341
 342
 343   <h3>Switch from GiST to GIN Indexes</h3>
 344   <p>
 345     When the TIGER data is imported via <q>shp2pgsql</q>, a <a
 346     href="http://www.postgresql.org/docs/8.4/static/textsearch-indexes.html">GiST</a>
 347     index is added to the geometry column by means of the
 348     <strong>-I</strong> flag. This improves the performance of the
 349     population calculations by a (wildly-estimates) order of
 350     magnitude.
 351   </p>
 352
 353   <p>
 354     Postgres, however, offers another type of similar index &mdash;
 355     the <a
 356     href="http://www.postgresql.org/docs/8.4/static/textsearch-indexes.html">GIN
 357     Index</a>. If performance degrades beyond what is acceptable, it
 358     may be worth evaluating the benefit of a GIN index versus the GiST
 359     one.
 360   </p>
 361
 362 </body>
 363 </html>