X-Git-Url: http://gitweb.michael.orlitzky.com/?a=blobdiff_plain;f=doc%2Fman1%2Fhtsn-import.1;h=f1edf446f5b20b5a68dddf9722aec2006869c1dd;hb=5660ce1af1e19f004aaecb024ef6f9fb95b9e5c7;hp=66506c6febaf300dd8be82ce7e5f9c1b624627fb;hpb=65b7b8eccef710e67fa6050dd4bdaffbbea708a7;p=dead%2Fhtsn-import.git diff --git a/doc/man1/htsn-import.1 b/doc/man1/htsn-import.1 index 66506c6..f1edf44 100644 --- a/doc/man1/htsn-import.1 +++ b/doc/man1/htsn-import.1 @@ -16,9 +16,10 @@ XML documents contained therein. But what to do with them? .P The purpose of \fBhtsn-import\fR is to take these XML documents and get them into something we can use, a relational database management -system (RDBMS), loosely known as a SQL database. The structure of +system (RDBMS), otherwise known as a SQL database. The structure of relational database, is, well, relational, and the feed XML is not. So -there is some work to do before the data can be inserted. +there is some work to do before the data can be imported into the +database. .P First, we must parse the XML. Each supported document type (see below) has a full pickle/unpickle implementation (\(dqpickle\(dq is simply a @@ -63,33 +64,46 @@ weatherxml.dtd .P At the top level, we have one table for each of the XML document types that we import. For example, the documents corresponding to -\fInewsxml.dtd\fR will have a table called \(dqnews\(dq. +\fInewsxml.dtd\fR will have a table called \(dqnews\(dq. All top-level +tables contain two important fields, \(dqxml_file_id\(dq and +\(dqtime_stamp\(dq. The former is unique and prevents us from +inserting the same data twice. The time stamp on the other hand lets +us know when the data is old and can be removed. The database schema +make it possible to delete only the outdated top-level records; all +transient children should be removed by triggers. .P These top-level tables will often have children. For example, each news item has zero or more locations associated with it. The child table will be named _, which in this case -corresponsds to \(dqnews_locations\(dq. -.P -To relate the two, a third table exists with name __. Note the two underscores. This prevents -ambiguity when the child table itself contains underscores. As long we -never go more than one level down, this system should suffice. The -table joining \(dqnews\(dq with \(dqnews_locations\(dq is thus called -\(dqnews__news_locations\(dq. -.P -Wherever possible, children are kept unique to prevent pointless -duplication. This slows down inserts, and speeds up reads (which we -assume are much more frequent). The current rate at which the feed -transmits XML is much too slow to cause problems inserting. +corresponds to \(dqnews_locations\(dq. +.P +To relate the two, a third table may exist with name +__. Note the two underscores. This prevents ambiguity +when the child table itself contains underscores. The table joining +\(dqnews\(dq with \(dqnews_locations\(dq is thus called +\(dqnews__news_locations\(dq. This is necessary when the child table +has a unique constraint; we don't want to blindly insert duplicate +records keyed to the parent. Instead we'd like to use the third table +to map an existing child to the new parent. +.P +Where it makes sense, children are kept unique to prevent pointless +duplication. This slows down inserts, and speeds up reads (which are +much more frequent). There is a tradeoff to be made, however. For a +table with a small, fixed upper bound on the number of rows (like +\(dqodds_casinos\(dq), there is great benefit to de-duplication. The +total number of rows stays small, so inserts are still quick, and many +duplicate rows are eliminated. +.P +But, with a table like \(dqodds_games\(dq, the number of games grows +quickly and without bound. It is therefore more beneficial to be able +to delete the old games (through an ON DELETE CASCADE, tied to +\(dqodds\(dq) than it is to eliminate duplication. A table like +\(dqnews_locations\(dq is somewhere in-between. It is hoped that the +unique constraint in the top-level table's \(dqxml_file_id\(dq will +prevent duplication in this case anyway. .P UML diagrams of the resulting database schema for each XML document type are provided with the \fBhtsn-import\fR documentation. -.P -In some cases the top-level table for a document type has been -omitted. For example, all of the information in the the -\(dqinjuriesxml\(dq documents is contained in \(dqlisting\(dq -elements. We therefore omit the \(dqinjuries\(dq table and create only -\(dqinjuries_listings\(dq. .SH OPTIONS @@ -101,7 +115,7 @@ Default: Sqlite .IP \fB\-\-connection-string\fR,\ \fB\-c\fR The connection string used for connecting to the database backend -given by the \fB\-\-baclend\fR option. The default is appropriate for +given by the \fB\-\-backend\fR option. The default is appropriate for the \fISqlite\fR backend. Default: \(dq:memory:\(dq @@ -135,12 +149,6 @@ not work. Default: disabled -.IP \fB\-\-username\fR,\ \fB\-u\fR -Your TSN username. A username is required, so you must supply one -either on the command line or in a configuration file. - -Default: none - .SH CONFIGURATION FILE .P Any of the command-line options mentioned above can be specified in a