.P
The purpose of \fBhtsn-import\fR is to take these XML documents and
get them into something we can use, a relational database management
-system (RDBMS), loosely known as a SQL database. The structure of
+system (RDBMS), otherwise known as a SQL database. The structure of
relational database, is, well, relational, and the feed XML is not. So
-there is some work to do before the data can be inserted.
+there is some work to do before the data can be imported into the
+database.
.P
First, we must parse the XML. Each supported document type (see below)
has a full pickle/unpickle implementation (\(dqpickle\(dq is simply a
.P
At the top level, we have one table for each of the XML document types
that we import. For example, the documents corresponding to
-\fInewsxml.dtd\fR will have a table called \(dqnews\(dq.
+\fInewsxml.dtd\fR will have a table called \(dqnews\(dq. All top-level
+tables contain two important fields, \(dqxml_file_id\(dq and
+\(dqtime_stamp\(dq. The former is unique and prevents us from
+inserting the same data twice. The time stamp on the other hand lets
+us know when the data is old and can be removed. The database schema
+make it possible to delete only the outdated top-level records; all
+transient children should be removed by triggers.
.P
These top-level tables will often have children. For example, each
news item has zero or more locations associated with it. The child
table will be named <parent>_<children>, which in this case
corresponds to \(dqnews_locations\(dq.
.P
-To relate the two, a third table may exist with name <parent
-table>__<child table>. Note the two underscores. This prevents
-ambiguity when the child table itself contains underscores. The table
-joining \(dqnews\(dq with \(dqnews_locations\(dq is thus called
-\(dqnews__news_locations\(dq.
+To relate the two, a third table may exist with name
+<parent>__<child>. Note the two underscores. This prevents ambiguity
+when the child table itself contains underscores. The table joining
+\(dqnews\(dq with \(dqnews_locations\(dq is thus called
+\(dqnews__news_locations\(dq. This is necessary when the child table
+has a unique constraint; we don't want to blindly insert duplicate
+records keyed to the parent. Instead we'd like to use the third table
+to map an existing child to the new parent.
.P
Where it makes sense, children are kept unique to prevent pointless
duplication. This slows down inserts, and speeds up reads (which are
.P
But, with a table like \(dqodds_games\(dq, the number of games grows
quickly and without bound. It is therefore more beneficial to be able
-to delete the old games (though an ON DELETE CASCADE, tied to
+to delete the old games (through an ON DELETE CASCADE, tied to
\(dqodds\(dq) than it is to eliminate duplication. A table like
-\(dqnews_locations\(dq is somewhere in-between.
+\(dqnews_locations\(dq is somewhere in-between. It is hoped that the
+unique constraint in the top-level table's \(dqxml_file_id\(dq will
+prevent duplication in this case anyway.
.P
UML diagrams of the resulting database schema for each XML document
type are provided with the \fBhtsn-import\fR documentation.
-.P
-In some cases the top-level table for a document type has been
-omitted. For example, all of the information in the the
-\(dqinjuriesxml\(dq documents is contained in \(dqlisting\(dq
-elements. We therefore omit the \(dqinjuries\(dq table and create only
-\(dqinjuries_listings\(dq.
.SH OPTIONS