X-Git-Url: http://gitweb.michael.orlitzky.com/?p=dead%2Fcensus-tools.git;a=blobdiff_plain;f=doc%2Fproject_overview%2Findex.xhtml;h=d1544b603165acd09b99a1566422204e3c565af7;hp=051e56782b7203b6967033559ea4ec44a477c8b9;hb=4be084ca20f114b7a5de282f1ce489b2a65ae311;hpb=935a6ead0912829a7e0f153aa7aac7494977e69c diff --git a/doc/project_overview/index.xhtml b/doc/project_overview/index.xhtml index 051e567..d1544b6 100644 --- a/doc/project_overview/index.xhtml +++ b/doc/project_overview/index.xhtml @@ -30,7 +30,7 @@ One of the foremost goals that must be achieved is to model the average population density throughout the United States. Using this data, we would like to be able to calculate the risk - associated with an event taking place somewhere in the + associated with an event taking place somewhere in the United States. This will, in general, be an accident or other unexpected event that causes some damage to the surrounding population and environment. @@ -75,6 +75,22 @@

    +
  1. + Most of the application code is written in Python, and so the Python + runtime is required to run it. + +
      +
    1. + We utilize a third-party library called Shapely for + Python/GEOS integration. GEOS is required by PostGIS (see + below), so it is not listed as a separate requirement, even + though Shapely does depend on it. +
    2. +
    +
  2. +
  3. The build system utilizes GNU Make. The @@ -162,7 +178,7 @@
  4. - A redundant field, called blkidfp00, which contains the + A redundant field, called blkidfp00, which contains the concatenation of block/state/county/tract. This is our unique identifier.
  5. @@ -180,7 +196,7 @@ We need to correlate the TIGER/Line geometric information with the demographic information contained in the Summary File 1 geographic header records. To do this, we need to rely on the unique - blkidfp00 identifier. + blkidfp00 identifier.

    @@ -198,8 +214,8 @@

    Note: the makefile provides a task - for creation/import of the databases, but its use is strictly - required. + for creation/import of the databases, but its use is not + strictly required.

    @@ -208,16 +224,16 @@

    A Postgres/PostGIS database is required to store our Census - data. The database name is unimportant (default: census), + data. The database name is unimportant (default: census), but several of the scripts refer to the table names. For - simplicity, we will call the database census from now on. + simplicity, we will call the database census from now on.

    Once the database has been created, we need to import two PostGIS tables so that we can support the GIS functionality. These two - files are lwpostgis.sql and - spatial_ref_sys.sql. See the lwpostgis.sql and + spatial_ref_sys.sql. See the makefile for an example of their import.

    @@ -254,12 +270,12 @@

    Since the shapefiles are in a standard format, we can use pre-existing tools to import the data in to our SQL - database. PostGIS provides a binary, shp2pgsql, that will + database. PostGIS provides a binary, shp2pgsql, that will parse and convert the shapefiles to SQL.

    - There is one caveat here: the shp2pgsql program requires + There is one caveat here: the shp2pgsql program requires an SRID as an argument; this SRID is assigned to each record it imports. We have designated an SRID of 4269, which denotes NAD83, or the North American Datum (1983). There may be @@ -269,5 +285,79 @@ States.

    +

    Possible Optimizations

    +

    + There are a number of possible optimizations that can be made + should performance ever become prohibitive. To date, these have + been eschewed for lack of flexibility and/or development time. +

    + +

    De-normalization of TIGER/SF1 Block Data

    +

    + Currently, the TIGER/Line block data is stored in a separate table + from the Summary File 1 block data. The two are combined at query + time via SQL + JOINs. Since we import the TIGER data first, and use a custom + import script for SF1, we could de-normalize + this design to increase query speed. +

    + +

    + This would slow down the SF1 import, of course; but the import + only needs to be performed once. The procedure would look like the + following: +

    + +
      +
    1. + Add the SF1 columns to the TIGER table, allowing them to be + nullable initially (since they will all be NULL at first). +
    2. + +
    3. + Within the SF1 import, we would, +
        +
      1. Parse a block
      2. +
      3. + Use that block's blkidfp00 to find the corresponding row in + the TIGER table. +
      4. +
      5. + Update the TIGER row with the values from SF1. +
      6. +
      +
    4. + +
    5. + Optionally set the SF1 columns to NOT NULL. This may have + some performance benefit, but I wouldn't count on it. +
    6. + +
    7. + Fix all the SQL queries to use the schema. +
    8. +
    + + +

    Switch from GiST to GIN Indexes

    +

    + When the TIGER data is imported via shp2pgsql, a GiST + index is added to the geometry column by means of the + -I flag. This improves the performance of the + population calculations by a (wildly-estimates) order of + magnitude. +

    + +

    + Postgres, however, offers another type of similar index — + the GIN + Index. If performance degrades beyond what is acceptable, it + may be worth evaluating the benefit of a GIN index versus the GiST + one. +

    +