]> gitweb.michael.orlitzky.com - dead/census-tools.git/blob - doc/project_overview/index.xhtml
3201609ff353831172b13b440e926a270903e302
[dead/census-tools.git] / doc / project_overview / index.xhtml
1 <?xml version="1.0" encoding="utf-8" ?>
2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
3
4 <html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
5
6 <head>
7 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
8 <title>Project Overview</title>
9
10 <link rel="stylesheet" type="text/css" href="reset.css" />
11 <link rel="stylesheet" type="text/css" href="common.css" />
12 </head>
13
14 <body>
15
16 <h1>Project Overview</h1>
17 <p id="project_title">
18 Chlorine Transportation Optimization for the Chemical Security
19 Analysis Center
20 </p>
21
22 <h2>General Goals</h2>
23 <p>
24 The overall goal of the census-tools project is to provide the
25 applied math lab (AML) with a set of libraries and utilities
26 necessary to complete its assignment.
27 </p>
28
29 <p>
30 One of the foremost goals that must be achieved is to model the
31 average population density throughout the United States. Using
32 this data, we would like to be able to calculate the risk
33 associated with an <em>event</em> taking place somewhere in the
34 United States. This will, in general, be an accident or other
35 unexpected event that causes some damage to the surrounding
36 population and environment.
37 </p>
38
39 <p>
40 Our canonical example (for which the project was created) is that
41 of a chlorine spill. If chlorine is spilled or otherwise released
42 somewhere in the United States, we would like to be able to
43 calculate the number of people affected.
44 </p>
45
46 <h2>Specific Goals</h2>
47 <ul>
48 <li>
49 Be able to determine the average population density at any point
50 within the United States, given by GPS coordinates.
51 </li>
52
53 <li>
54 Determine the total population within a subset of the U.S. This
55 subset will be defined by a polygon or other two-dimensional
56 (possibly piecewise-defined) function.
57 </li>
58
59 <li>
60 Calculate the path that roads follow throughout the
61 U.S. Ideally, the path would be calculated in terms of GPS
62 coordinates within some <a
63 href="http://en.wikipedia.org/wiki/SRID">spatial reference
64 system</a>.
65 </li>
66 </ul>
67
68 <h2>Software Requirements</h2>
69 <p>
70 The following software is required to utilize census-tools in
71 full. It should be possible to reuse the code independently of the
72 other software, but many of the features such as database
73 integration and build system automation require third-party
74 programs.
75 </p>
76
77 <ol>
78 <li>
79 The build system utilizes <a
80 href="http://www.gnu.org/software/make/">GNU Make</a>. The
81 makefile simply automates some processes which can be run
82 manually: executing tests, downloading data, importing
83 shapefiles, etc. Other (non-GNU) versions of Make should also
84 work.
85
86 <ol>
87 <li>
88 The Make system assumes that the user has a POSIX-compatible
89 shell. The author has tested the makefiles with <a
90 href="http://en.wikipedia.org/wiki/Bash">Bash</a>, but other
91 shells such as <a
92 href="http://en.wikipedia.org/wiki/Cygwin">Cygwin</a> should
93 suffice.
94 </li>
95
96 <li>
97 <a href="http://en.wikipedia.org/wiki/Wget">wget</a> is used
98 to download the Census shapefiles.
99 </li>
100 </ol>
101 </li>
102
103 <li>
104 The underlying database management system is <a
105 href="http://www.postgresql.org/">Postgresql</a>.
106 </li>
107
108 <li>
109 The GIS capabilities are provided (to Postgresql) by <a
110 href="http://postgis.refractions.net/">PostGIS</a>.
111 </li>
112 </ol>
113
114 <h2>Census Databases</h2>
115 <p>
116 There are two Census databases (obtained as flat files) which we
117 require for our purposes. The first is <a
118 href="http://www.census.gov/Press-Release/www/2001/sumfile1.html">Summary
119 File 1</a>, which contains the block-level demographic
120 information. The second database, <a
121 href="http://www.census.gov/geo/www/tiger/">TIGER/Line</a>, gives
122 us the geometry of those block.
123 </p>
124
125 <h3>Summary File 1</h3>
126
127 <p>
128 There are a number of tables contained within the Summary File 1
129 database. We require only one of these: the geographic header
130 records. The Summary File 1 geographic header records provide,
131 among other things, the following:
132 </p>
133
134 <ul>
135 <li>The GPS coordinates of each block's centroid.</li>
136 <li>The total population contained within a block.</li>
137 <li>The total land/water area of each block.</li>
138 <li>
139 Block/state/county/tract identifiers which combine to form a
140 unique identifier.
141 </li>
142 </ul>
143
144
145 <h3>TIGER/Line</h3>
146
147 <p>
148 The TIGER/Line information comes in the form of <a
149 href="http://en.wikipedia.org/wiki/Shapefile">ESRI
150 shapefiles</a>. These shapefiles contain the geometric information
151 about the Census blocks.
152 </p>
153
154 <p>
155 Each TIGER/Line record contains,
156 </p>
157
158 <ul>
159 <li>
160 The same block/state/county/tract information as the Summary
161 File 1 geographic header records.
162 </li>
163
164 <li>
165 A redundant field, called <em>blkidfp00</em>, which contains the
166 concatenation of block/state/county/tract. This is our unique
167 identifier.
168 </li>
169
170 <li>
171 A geometry column containing a set of points, lines, and
172 polygons that define the boundaries of the block.
173 </li>
174 </ul>
175
176
177 <h3>Summary File 1 / TIGER Integration</h3>
178
179 <p>
180 We need to correlate the TIGER/Line geometric information with the
181 demographic information contained in the Summary File 1 geographic
182 header records. To do this, we need to rely on the unique
183 <em>blkidfp00</em> identifier.
184 </p>
185
186 <p>
187 For the TIGER table, this field has already been calculated. We
188 calculate it manually for the geographic header records. Then, we
189 can define a one-to-one correspondence between geographic header
190 records and TIGER/Line records. This satisfies one of our goals:
191 correlating the two tables allows us to determine the population
192 density at any point.
193 </p>
194
195
196 <h2>Importing the Data</h2>
197
198 <p>
199 <em>
200 Note: the <a href="../../makefile">makefile</a> provides a task
201 for creation/import of the databases, but its use is strictly
202 required.
203 </em>
204 </p>
205
206
207 <h3>Creation of the Database</h3>
208
209 <p>
210 A Postgres/PostGIS database is required to store our Census
211 data. The database name is unimportant (default: <em>census</em>),
212 but several of the scripts refer to the table names. For
213 simplicity, we will call the database <em>census</em> from now on.
214 </p>
215
216 <p>
217 Once the database has been created, we need to import two PostGIS
218 tables so that we can support the GIS functionality. These two
219 files are <em>lwpostgis.sql</em> and
220 <em>spatial_ref_sys.sql</em>. See the <a
221 href="../../makefile">makefile</a> for an example of their import.
222 </p>
223
224 <p>
225 Before we import any data, we need to develop a database schema
226 that models our Census data. The TIGER/Line schema is provided for
227 us, so we have modeled the Summary File 1 schema in a similar
228 fashion.
229 </p>
230
231 <p>
232 The Summary File 1 table schema is <a
233 href="../../sql/create-sf1_blocks-table.sql">defined in a standard
234 SQL script</a>. The following steps will all require that a
235 database exist, and contain the table defined by this script.
236 </p>
237
238
239 <h3>Importing the Summary File 1 Block Records</h3>
240
241 <p>
242 The Summary File 1 data are obtained as flat files. We therefore
243 utilize a Python script, <a href="../../bin/sf1blocks2sql"></a>,
244 to parse and import the blocks. The source code to the script is
245 documented, but basically, it performs three tasks: it parses each
246 record from the geographic header record file, filters out any
247 non-block records, and then writes out a SQL statement to insert
248 that record in to the Postgres/PostGIS database.
249 </p>
250
251
252 <h3>Importing the TIGER/Line Shapefiles</h3>
253
254 <p>
255 Since the shapefiles are in a standard format, we can use
256 pre-existing tools to import the data in to our SQL
257 database. PostGIS provides a binary, <em>shp2pgsql</em>, that will
258 parse and convert the shapefiles to SQL.
259 </p>
260
261 <p>
262 There is one caveat here: the <em>shp2pgsql</em> program requires
263 an SRID as an argument; this SRID is assigned to each record it
264 imports. We have designated an SRID of 4269, which denotes
265 <q>NAD83</q>, or the North American Datum (1983). There may be
266 some possibility of improvement here; since we are only
267 considering the Mid-Atlantic, there may be coordinate systems
268 (SRIDs) which are more accurate in our subset of the United
269 States.
270 </p>
271
272 <h2>Possible Optimizations</h2>
273 <p>
274 There are a number of possible optimizations that can be made
275 should performance ever become prohibitive. To date, these have
276 been eschewed for lack of flexibility and/or development time.
277 </p>
278
279 <h3>De-normalization of TIGER/SF1 Block Data</h3>
280 <p>
281 Currently, the TIGER/Line block data is stored in a separate table
282 from the Summary File 1 block data. The two are combined at query
283 time via <a href="http://en.wikipedia.org/wiki/Join_(SQL)">SQL
284 JOIN</a>s. Since we import the TIGER data first, and use a custom
285 import script for SF1, we could <a
286 href="http://en.wikipedia.org/wiki/Denormalization">de-normalize</a>
287 this design to increase query speed.
288 </p>
289
290 <p>
291 This would slow down the SF1 import, of course; but the import
292 only needs to be performed once. The procedure would look like the
293 following:
294 </p>
295
296 <ol>
297 <li>
298 Add the SF1 columns to the TIGER table, allowing them to be
299 nullable initially (since they will all be NULL at first).
300 </li>
301
302 <li>
303 Within the SF1 import, we would,
304 <ol>
305 <li>Parse a block</li>
306 <li>
307 Use that block's blkidfp00 to find the corresponding row in
308 the TIGER table.
309 </li>
310 <li>
311 Update the TIGER row with the values from SF1.
312 </li>
313 </ol>
314 </li>
315
316 <li>
317 Optionally set the SF1 columns to NOT NULL. This may have
318 <em>some</em> performance benefit, but I wouldn't count on it.
319 </li>
320
321 <li>
322 Fix all the SQL queries to use the schema.
323 </li>
324 </ol>
325
326
327 <h3>Switch from GiST to GIN Indexes</h3>
328 <p>
329 When the TIGER data is imported via <em>shp2pgsql</em>, a <a
330 href="http://www.postgresql.org/docs/8.4/static/textsearch-indexes.html">GiST</a>
331 index is added to the geometry column by means of the
332 <strong>-I</strong> flag. This improves the performance of the
333 population calculations by a (wildly-estimates) order of
334 magnitude.
335 </p>
336
337 <p>
338 Postgres, however, offers another type of similar index &mdash;
339 the <a
340 href="http://www.postgresql.org/docs/8.4/static/textsearch-indexes.html">GIN
341 Index</a>. If performance degrades beyond what is acceptable, it
342 may be worth evaluating the benefit of a GIN index versus the GiST
343 one.
344 </p>
345
346 </body>
347 </html>