From: Michael Orlitzky Date: Tue, 21 Jan 2014 19:00:05 +0000 (-0500) Subject: Update the database schema docs. X-Git-Tag: 0.0.1~10 X-Git-Url: https://gitweb.michael.orlitzky.com/?a=commitdiff_plain;h=a599e73b762cc14239c2dc22be9bec7c1df90548;p=dead%2Fhtsn-import.git Update the database schema docs. --- diff --git a/doc/man1/htsn-import.1 b/doc/man1/htsn-import.1 index b0b4f9c..f1edf44 100644 --- a/doc/man1/htsn-import.1 +++ b/doc/man1/htsn-import.1 @@ -16,9 +16,10 @@ XML documents contained therein. But what to do with them? .P The purpose of \fBhtsn-import\fR is to take these XML documents and get them into something we can use, a relational database management -system (RDBMS), loosely known as a SQL database. The structure of +system (RDBMS), otherwise known as a SQL database. The structure of relational database, is, well, relational, and the feed XML is not. So -there is some work to do before the data can be inserted. +there is some work to do before the data can be imported into the +database. .P First, we must parse the XML. Each supported document type (see below) has a full pickle/unpickle implementation (\(dqpickle\(dq is simply a @@ -63,18 +64,27 @@ weatherxml.dtd .P At the top level, we have one table for each of the XML document types that we import. For example, the documents corresponding to -\fInewsxml.dtd\fR will have a table called \(dqnews\(dq. +\fInewsxml.dtd\fR will have a table called \(dqnews\(dq. All top-level +tables contain two important fields, \(dqxml_file_id\(dq and +\(dqtime_stamp\(dq. The former is unique and prevents us from +inserting the same data twice. The time stamp on the other hand lets +us know when the data is old and can be removed. The database schema +make it possible to delete only the outdated top-level records; all +transient children should be removed by triggers. .P These top-level tables will often have children. For example, each news item has zero or more locations associated with it. The child table will be named _, which in this case corresponds to \(dqnews_locations\(dq. .P -To relate the two, a third table may exist with name __. Note the two underscores. This prevents -ambiguity when the child table itself contains underscores. The table -joining \(dqnews\(dq with \(dqnews_locations\(dq is thus called -\(dqnews__news_locations\(dq. +To relate the two, a third table may exist with name +__. Note the two underscores. This prevents ambiguity +when the child table itself contains underscores. The table joining +\(dqnews\(dq with \(dqnews_locations\(dq is thus called +\(dqnews__news_locations\(dq. This is necessary when the child table +has a unique constraint; we don't want to blindly insert duplicate +records keyed to the parent. Instead we'd like to use the third table +to map an existing child to the new parent. .P Where it makes sense, children are kept unique to prevent pointless duplication. This slows down inserts, and speeds up reads (which are @@ -86,18 +96,14 @@ duplicate rows are eliminated. .P But, with a table like \(dqodds_games\(dq, the number of games grows quickly and without bound. It is therefore more beneficial to be able -to delete the old games (though an ON DELETE CASCADE, tied to +to delete the old games (through an ON DELETE CASCADE, tied to \(dqodds\(dq) than it is to eliminate duplication. A table like -\(dqnews_locations\(dq is somewhere in-between. +\(dqnews_locations\(dq is somewhere in-between. It is hoped that the +unique constraint in the top-level table's \(dqxml_file_id\(dq will +prevent duplication in this case anyway. .P UML diagrams of the resulting database schema for each XML document type are provided with the \fBhtsn-import\fR documentation. -.P -In some cases the top-level table for a document type has been -omitted. For example, all of the information in the the -\(dqinjuriesxml\(dq documents is contained in \(dqlisting\(dq -elements. We therefore omit the \(dqinjuries\(dq table and create only -\(dqinjuries_listings\(dq. .SH OPTIONS