]> gitweb.michael.orlitzky.com - dead/census-tools.git/blob - doc/project_overview/index.xhtml
Replaced a missing "not".
[dead/census-tools.git] / doc / project_overview / index.xhtml
1 <?xml version="1.0" encoding="utf-8" ?>
2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
3
4 <html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
5
6 <head>
7 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
8 <title>Project Overview</title>
9
10 <link rel="stylesheet" type="text/css" href="reset.css" />
11 <link rel="stylesheet" type="text/css" href="common.css" />
12 </head>
13
14 <body>
15
16 <h1>Project Overview</h1>
17 <p id="project_title">
18 Chlorine Transportation Optimization for the Chemical Security
19 Analysis Center
20 </p>
21
22 <h2>General Goals</h2>
23 <p>
24 The overall goal of the census-tools project is to provide the
25 applied math lab (AML) with a set of libraries and utilities
26 necessary to complete its assignment.
27 </p>
28
29 <p>
30 One of the foremost goals that must be achieved is to model the
31 average population density throughout the United States. Using
32 this data, we would like to be able to calculate the risk
33 associated with an <q>event</q> taking place somewhere in the
34 United States. This will, in general, be an accident or other
35 unexpected event that causes some damage to the surrounding
36 population and environment.
37 </p>
38
39 <p>
40 Our canonical example (for which the project was created) is that
41 of a chlorine spill. If chlorine is spilled or otherwise released
42 somewhere in the United States, we would like to be able to
43 calculate the number of people affected.
44 </p>
45
46 <h2>Specific Goals</h2>
47 <ul>
48 <li>
49 Be able to determine the average population density at any point
50 within the United States, given by GPS coordinates.
51 </li>
52
53 <li>
54 Determine the total population within a subset of the U.S. This
55 subset will be defined by a polygon or other two-dimensional
56 (possibly piecewise-defined) function.
57 </li>
58
59 <li>
60 Calculate the path that roads follow throughout the
61 U.S. Ideally, the path would be calculated in terms of GPS
62 coordinates within some <a
63 href="http://en.wikipedia.org/wiki/SRID">spatial reference
64 system</a>.
65 </li>
66 </ul>
67
68 <h2>Software Requirements</h2>
69 <p>
70 The following software is required to utilize census-tools in
71 full. It should be possible to reuse the code independently of the
72 other software, but many of the features such as database
73 integration and build system automation require third-party
74 programs.
75 </p>
76
77 <ol>
78 <li>
79 Most of the application code is written in <a
80 href="http://www.python.org/">Python</a>, and so the Python
81 runtime is required to run it.
82
83 <ol>
84 <li>
85 We utilize a third-party library called <a
86 href="http://pypi.python.org/pypi/Shapely">Shapely</a> for
87 Python/GEOS integration. GEOS is required by PostGIS (see
88 below), so it is not listed as a separate requirement, even
89 though Shapely does depend on it.
90 </li>
91 </ol>
92 </li>
93
94 <li>
95 The build system utilizes <a
96 href="http://www.gnu.org/software/make/">GNU Make</a>. The
97 makefile simply automates some processes which can be run
98 manually: executing tests, downloading data, importing
99 shapefiles, etc. Other (non-GNU) versions of Make should also
100 work.
101
102 <ol>
103 <li>
104 The Make system assumes that the user has a POSIX-compatible
105 shell. The author has tested the makefiles with <a
106 href="http://en.wikipedia.org/wiki/Bash">Bash</a>, but other
107 shells such as <a
108 href="http://en.wikipedia.org/wiki/Cygwin">Cygwin</a> should
109 suffice.
110 </li>
111
112 <li>
113 <a href="http://en.wikipedia.org/wiki/Wget">wget</a> is used
114 to download the Census shapefiles.
115 </li>
116 </ol>
117 </li>
118
119 <li>
120 The underlying database management system is <a
121 href="http://www.postgresql.org/">Postgresql</a>.
122 </li>
123
124 <li>
125 The GIS capabilities are provided (to Postgresql) by <a
126 href="http://postgis.refractions.net/">PostGIS</a>.
127 </li>
128 </ol>
129
130 <h2>Census Databases</h2>
131 <p>
132 There are two Census databases (obtained as flat files) which we
133 require for our purposes. The first is <a
134 href="http://www.census.gov/Press-Release/www/2001/sumfile1.html">Summary
135 File 1</a>, which contains the block-level demographic
136 information. The second database, <a
137 href="http://www.census.gov/geo/www/tiger/">TIGER/Line</a>, gives
138 us the geometry of those block.
139 </p>
140
141 <h3>Summary File 1</h3>
142
143 <p>
144 There are a number of tables contained within the Summary File 1
145 database. We require only one of these: the geographic header
146 records. The Summary File 1 geographic header records provide,
147 among other things, the following:
148 </p>
149
150 <ul>
151 <li>The GPS coordinates of each block's centroid.</li>
152 <li>The total population contained within a block.</li>
153 <li>The total land/water area of each block.</li>
154 <li>
155 Block/state/county/tract identifiers which combine to form a
156 unique identifier.
157 </li>
158 </ul>
159
160
161 <h3>TIGER/Line</h3>
162
163 <p>
164 The TIGER/Line information comes in the form of <a
165 href="http://en.wikipedia.org/wiki/Shapefile">ESRI
166 shapefiles</a>. These shapefiles contain the geometric information
167 about the Census blocks.
168 </p>
169
170 <p>
171 Each TIGER/Line record contains,
172 </p>
173
174 <ul>
175 <li>
176 The same block/state/county/tract information as the Summary
177 File 1 geographic header records.
178 </li>
179
180 <li>
181 A redundant field, called <q>blkidfp00</q>, which contains the
182 concatenation of block/state/county/tract. This is our unique
183 identifier.
184 </li>
185
186 <li>
187 A geometry column containing a set of points, lines, and
188 polygons that define the boundaries of the block.
189 </li>
190 </ul>
191
192
193 <h3>Summary File 1 / TIGER Integration</h3>
194
195 <p>
196 We need to correlate the TIGER/Line geometric information with the
197 demographic information contained in the Summary File 1 geographic
198 header records. To do this, we need to rely on the unique
199 <q>blkidfp00</q> identifier.
200 </p>
201
202 <p>
203 For the TIGER table, this field has already been calculated. We
204 calculate it manually for the geographic header records. Then, we
205 can define a one-to-one correspondence between geographic header
206 records and TIGER/Line records. This satisfies one of our goals:
207 correlating the two tables allows us to determine the population
208 density at any point.
209 </p>
210
211
212 <h2>Importing the Data</h2>
213
214 <p>
215 <em>
216 Note: the <a href="../../makefile">makefile</a> provides a task
217 for creation/import of the databases, but its use is not
218 strictly required.
219 </em>
220 </p>
221
222
223 <h3>Creation of the Database</h3>
224
225 <p>
226 A Postgres/PostGIS database is required to store our Census
227 data. The database name is unimportant (default: <q>census</q>),
228 but several of the scripts refer to the table names. For
229 simplicity, we will call the database <q>census</q> from now on.
230 </p>
231
232 <p>
233 Once the database has been created, we need to import two PostGIS
234 tables so that we can support the GIS functionality. These two
235 files are <q>lwpostgis.sql</q> and
236 <q>spatial_ref_sys.sql</q>. See the <a
237 href="../../makefile">makefile</a> for an example of their import.
238 </p>
239
240 <p>
241 Before we import any data, we need to develop a database schema
242 that models our Census data. The TIGER/Line schema is provided for
243 us, so we have modeled the Summary File 1 schema in a similar
244 fashion.
245 </p>
246
247 <p>
248 The Summary File 1 table schema is <a
249 href="../../sql/create-sf1_blocks-table.sql">defined in a standard
250 SQL script</a>. The following steps will all require that a
251 database exist, and contain the table defined by this script.
252 </p>
253
254
255 <h3>Importing the Summary File 1 Block Records</h3>
256
257 <p>
258 The Summary File 1 data are obtained as flat files. We therefore
259 utilize a Python script, <a href="../../bin/sf1blocks2sql"></a>,
260 to parse and import the blocks. The source code to the script is
261 documented, but basically, it performs three tasks: it parses each
262 record from the geographic header record file, filters out any
263 non-block records, and then writes out a SQL statement to insert
264 that record in to the Postgres/PostGIS database.
265 </p>
266
267
268 <h3>Importing the TIGER/Line Shapefiles</h3>
269
270 <p>
271 Since the shapefiles are in a standard format, we can use
272 pre-existing tools to import the data in to our SQL
273 database. PostGIS provides a binary, <q>shp2pgsql</q>, that will
274 parse and convert the shapefiles to SQL.
275 </p>
276
277 <p>
278 There is one caveat here: the <q>shp2pgsql</q> program requires
279 an SRID as an argument; this SRID is assigned to each record it
280 imports. We have designated an SRID of 4269, which denotes
281 <q>NAD83</q>, or the North American Datum (1983). There may be
282 some possibility of improvement here; since we are only
283 considering the Mid-Atlantic, there may be coordinate systems
284 (SRIDs) which are more accurate in our subset of the United
285 States.
286 </p>
287
288 <h2>Possible Optimizations</h2>
289 <p>
290 There are a number of possible optimizations that can be made
291 should performance ever become prohibitive. To date, these have
292 been eschewed for lack of flexibility and/or development time.
293 </p>
294
295 <h3>De-normalization of TIGER/SF1 Block Data</h3>
296 <p>
297 Currently, the TIGER/Line block data is stored in a separate table
298 from the Summary File 1 block data. The two are combined at query
299 time via <a href="http://en.wikipedia.org/wiki/Join_(SQL)">SQL
300 JOIN</a>s. Since we import the TIGER data first, and use a custom
301 import script for SF1, we could <a
302 href="http://en.wikipedia.org/wiki/Denormalization">de-normalize</a>
303 this design to increase query speed.
304 </p>
305
306 <p>
307 This would slow down the SF1 import, of course; but the import
308 only needs to be performed once. The procedure would look like the
309 following:
310 </p>
311
312 <ol>
313 <li>
314 Add the SF1 columns to the TIGER table, allowing them to be
315 nullable initially (since they will all be NULL at first).
316 </li>
317
318 <li>
319 Within the SF1 import, we would,
320 <ol>
321 <li>Parse a block</li>
322 <li>
323 Use that block's blkidfp00 to find the corresponding row in
324 the TIGER table.
325 </li>
326 <li>
327 Update the TIGER row with the values from SF1.
328 </li>
329 </ol>
330 </li>
331
332 <li>
333 Optionally set the SF1 columns to NOT NULL. This may have
334 <em>some</em> performance benefit, but I wouldn't count on it.
335 </li>
336
337 <li>
338 Fix all the SQL queries to use the schema.
339 </li>
340 </ol>
341
342
343 <h3>Switch from GiST to GIN Indexes</h3>
344 <p>
345 When the TIGER data is imported via <q>shp2pgsql</q>, a <a
346 href="http://www.postgresql.org/docs/8.4/static/textsearch-indexes.html">GiST</a>
347 index is added to the geometry column by means of the
348 <strong>-I</strong> flag. This improves the performance of the
349 population calculations by a (wildly-estimates) order of
350 magnitude.
351 </p>
352
353 <p>
354 Postgres, however, offers another type of similar index &mdash;
355 the <a
356 href="http://www.postgresql.org/docs/8.4/static/textsearch-indexes.html">GIN
357 Index</a>. If performance degrades beyond what is acceptable, it
358 may be worth evaluating the benefit of a GIN index versus the GiST
359 one.
360 </p>
361
362 </body>
363 </html>