+Most of the production schemas are generated this way; however, a few
+needed manual tweaking. The final, believed-to-be-correct schemas for
+all supported document types can be found in the \(dqschema\(dq folder in
+the project root. Having the correct DTDs available means you
+don't need XML-Schema-learner available to install \fBhtsn-import\fR.
+
+.SH XML SCHEMA UPDATES
+.P
+If a new tag is added to an XML document type, \fBhtsn-import\fR will
+most likely refuse to parse it, since the new documents no longer
+match the existing DTD.
+.P
+The first thing to do in that case is add the unparseable document to
+the \(dqschemagen\(dq directory, and generate a new DTD that matches
+both the old and new samples. Once a new, correct DTD has been
+generated, it should be added to the \(dqschema\(dq directory. Then,
+the parser can be updated and \fBhtsn-import\fR rebuilt.
+.P
+At this point, \fBhtsn-import\fR should be capable of importing the
+new document. But the addition of the new tag will most require new
+fields in the database. Fortunately, easy migrations like this are
+handled automatically. As an example, at one point, \fIOdds_XML.dtd\fR
+did not contain the \(dqHStarter\(dq and \(dqAStarter\(dq elements
+associated with its games. Suppose we parse one of the old documents
+(without \(dqHStarter\(dq and \(dqAStarter\(dq) using an old version
+of \fBhtsn-import\fR:
+.P
+.nf
+.I $ htsn-import --connection-string='foo.sqlite3' \\\\
+.I " schemagen/Odds_XML/19996433.xml"
+Migration: CREATE TABLE \(dqodds\(dq ...
+Successfully imported schemagen/Odds_XML/19996433.xml.
+Processed 1 document(s) total.
+.fi
+.P
+At this point, the database schema matches the old documents, that is,
+the ones without \fIAStarter\fR and \fIHStarter\fR. If we use a new
+version of \fBhtsn-import\fR, supporting the new fields, the migration
+is handled gracefully:
+.P
+.nf
+.I $ htsn-import --connection-string='foo.sqlite3' \\\\
+.I " schemagen/Odds_XML/21315768.xml"
+Migration: ALTER TABLE \(dqodds_games\(dq
+ ADD COLUMN \(dqaway_team_starter_id\(dq INTEGER;
+Migration: ALTER TABLE \(dqodds_games\(dq
+ ADD COLUMN \(dqaway_team_starter_name\(dq VARCHAR;
+Migration: ALTER TABLE \(dqodds_games\(dq
+ ADD COLUMN \(dqhome_team_starter_id\(dq INTEGER;
+Migration: ALTER TABLE \(dqodds_games\(dq
+ ADD COLUMN \(dqhome_team_starter_name\(dq VARCHAR;
+Successfully imported schemagen/Odds_XML/21315768.xml.
+Processed 1 document(s) total.
+.fi
+.P
+If fields are removed from the schema, then manual intervention may be
+necessary:
+.P
+.nf
+.I $ htsn-import -b Postgres -c 'dbname=htsn user=postgres' \\\\
+.I " schemagen/Odds_XML/19996433.xml"
+ERROR: Database migration: manual intervention required.
+The following actions are considered unsafe:
+ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqaway_team_starter_id\(dq
+ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqaway_team_starter_name\(dq
+ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqhome_team_starter_id\(dq
+ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqhome_team_starter_name\(dq
+
+ERROR: Failed to import file schemagen/Odds_XML/19996433.xml.
+Processed 0 document(s) total.
+.fi
+.P
+To fix these errors, manually invoke the SQL commands that were
+considered unsafe:
+.P
+.nf
+.I $ psql -U postgres -d htsn \\\\
+.I " -c 'ALTER TABLE odds_games DROP COLUMN away_team_starter_id;'"
+ALTER TABLE
+.I $ psql -U postgres -d htsn \\\\
+.I " -c 'ALTER TABLE odds_games DROP COLUMN away_team_starter_name;'"
+ALTER TABLE
+.I $ psql -U postgres -d htsn \\\\
+.I " -c 'ALTER TABLE odds_games DROP COLUMN home_team_starter_id;'"
+ALTER TABLE
+.I $ psql -U postgres -d htsn \\\\
+.I " -c 'ALTER TABLE odds_games DROP COLUMN home_team_starter_name;'"
+ALTER TABLE
+.fi
+.P
+After manually adjusting the schema, the import should succeed.
+
+.SH XML SCHEMA ODDITIES
+.P
+There are a number of problems with the XML on the wire. Even if we
+construct the DTDs ourselves, the results are sometimes
+inconsistent. Here we document a few of them.
+
+.IP \[bu] 2
+\fInewsxml.dtd\fR
+
+The TSN DTD for news (and almost all XML on the wire) suggests that
+there is a exactly one (possibly-empty) <SMS> element present in each
+message. However, we have seen an example (XML_File_ID 21232353) where
+an empty <SMS> followed a non-empty one:
+
+.fi
+<SMS>Odd Man Rush: Snow under pressure to improve Isles quickly</SMS>
+<SMS></SMS>
+.nf
+
+We don't parse this case at the moment, but we do recognize it and report
+it as unsupported so that offending documents can be removed. An example
+is provided as test/xml/newsxml-multiple-sms.xml.
+
+.IP \[bu]
+\fIMLB_earlylineXML.dtd\fR
+
+Unlike earlylineXML.dtd, this document type has more than one <game>
+associated with each <date>. Moreover, each <date> has a bunch of
+<note> children that are supposed to be associated with the <game>s,
+but the document structure indicates no explicit relationship. For
+example,
+
+.nf
+<date>
+ <note>...</note>
+ <game>...</game>
+ <game>...</game>
+ <note>...</note>
+ <game>...</game>
+</date>
+.fi
+
+Here the first <note> is inferred to apply to the two <game>s that
+follow it, and the second <note> applies to the single <game> that
+follows it. But this is very fragile to parse. Instead, we use a hack
+to facilitate (un)pickling, and then drop the notes entirely during
+the database conversion.
+
+A similar workaround is implemented for Odds_XML.dtd.
+
+.IP \[bu]
+\fIOdds_XML.dtd\fR
+
+The <Notes> elements here are supposed to be associated with a set of
+<Game> elements, but since the pair
+(<Notes>...</Notes><Game>...</Game>) can appear zero or more times,
+this leads to ambiguity in parsing. We therefore ignore the notes
+entirely (although a hack is employed to facilitate parsing). The same
+thing goes for the newer <League_Name> element.
+
+.IP \[bu]
+\fIweatherxml.dtd\fR
+
+There appear to be two types of weather documents; the first has
+<listing> contained within <forecast> and the second has <forecast>
+contained within <listing>. While it would be possible to parse both,
+it would greatly complicate things. The first form is more common, so
+that's all we support for now. An example is provided as
+test/xml/weatherxml-type2.xml.
+
+We are however able to identify the second type. When one is
+encountered, an informational message (that it is unsupported) will be
+printed. If the \fI\-\-remove\fR flag is used, the file will be
+deleted. This prevents documents that we know we can't import from
+building up.
+
+Another problem that comes up occasionally is that the home and away
+team elements appear in the reverse order. As in the other case, we
+report these as unsupported and then \(dqsucceed\(dq so that the
+offending document can be removed if desired. An example is provided
+as test/xml/weatherxml-backwards-teams.xml.
+
+.SH DATE/TIME ISSUES
+
+Dates and times appear in a number of places on the feed. The date
+portions are usually, fine, but the times often lack important
+information such as the time zone, or whether \(dq8 o'clock\(dq means
+a.m. or p.m.
+
+The most pervasive issue occurs with the timestamps that are included
+in every message. A typical timestamp looks like,
+
+.nf
+<time_stamp> May 24, 2014, at 04:18 PM ET </time_stamp>
+.fi
+
+The \(dqtime zone\(dq is given as \(dqET\(dq, but unfortunately
+\(dqET\(dq is not a valid time zone. It stands for \(dqEastern
+Time\(dq, which can belong to either of two time zones, EST or EDT,
+based on the time of the year (that is, whether or not daylight
+savings time is in effect) and one's location (for example, Arizona
+doesn't observe daylight savings time). It's not much more useful to
+be off by one hour than it is to be off by five hours, and since we
+can't determine the true offset from the timestamp, we always parse
+and store these as UTC.
+
+Here's a list of the ones that may cause surprises:
+
+.IP \[bu] 2
+\fIAutoRacingResultsXML.dtd\fR
+
+The <RaceDate> elements contain a full date and time, but no time zone
+information:
+
+.nf
+<RaceDate>5/24/2014 2:45:00 PM</RaceDate>
+.fi
+
+We parse them as UTC, which will be wrong when stored,
+but \(dqcorrect\(dq if the new UTC time zone is ignored.
+
+.IP \[bu]
+\fIAuto_Racing_Schedule_XML.dtd\fR
+
+The <Race_Date> and <Race_Time> elements are combined into on field in
+the database, but no time zone information is given. For example,
+
+.nf
+<Race_Date>02/16/2013</Race_Date>
+<Race_Time>08:10 PM</Race_Time>
+.fi
+
+As a result, we parse and store the times as UTC. The race times are
+not always present in the database, but when they are missing, they
+are presented as \(dqTBA\(dq (to be announced):
+
+.nf
+<Race_Time>TBA</Race_Time>
+.fi
+
+Since the dates do not appear to be optional, we store only the race
+date in that case.
+
+.IP \[bu]
+\fIearlylineXML.dtd\fR
+
+The <time> elements in the early lines contain neither a time zone nor
+an am/pm identifier:
+
+.nf
+<time>8:30</time>
+.fi
+
+The times are parsed and stored as UTC, since we
+don't have any other information upon which to base a guess. Even if
+one ignores the UTC time zone, the time can possibly be off by 12
+hours (due to the a.m./p.m. issue).
+
+The game <time> elements can also be empty. Since we store the
+combined game date/time in one field, these games will appear to begin
+at midnight on the day they occur.
+
+.IP \[bu]
+\fIjfilexml.dtd\fR
+
+The <Game_Date> and <Game_Time> elements are combined into on field in
+the database, but no time zone information is given. For example,
+
+.nf
+<Game_Date>06/15/2014</Game_Date>
+<Game_Time>08:00 PM</Game_Time>
+.fi
+
+As a result, we parse and store the times as UTC.
+
+The <CurrentTimestamp> elements suffer a similar problem, sans the
+date:
+
+.nf
+<CurrentTimeStamp>11:30 A.M.</CurrentTimeStamp>
+.fi
+
+They are also stored as UTC.
+
+.IP \[bu]
+\fIMLB_earlylineXML.dtd\fR
+
+See earlylineXML.dtd.
+
+.IP \[bu]
+\fIOdds_XML.dtd\fR
+
+The <Game_Date> and <Game_Time> elements are combined into on field in
+the database, but no time zone information is given. For example,
+
+.nf
+<Game_Date>01/04/2014</Game_Date>
+<Game_Time>04:35 PM</Game_Time>
+.fi
+
+As a result, we parse and store the times as UTC.
+
+.IP \[bu]
+\fISchedule_Changes_XML.dtd\fR
+
+The <Game_Date> and <Game_Time> elements are combined into on field in
+the database, but no time zone information is given. For example,
+
+.nf
+<Game_Date>06/06/2014</Game_Date>
+<Game_Time>04:00 PM</Game_Time>
+.fi
+
+As a result, we parse and store the times as UTC. The game times are
+not always present in the database, but when they are missing, they
+are presented as \(dqTBA\(dq (to be announced):
+
+.nf
+<Game_Time>TBA</Game_Time>
+.fi
+
+Since the dates do not appear to be optional, we store only the game
+date in that case.
+
+.SH DEPLOYMENT
+.P
+When deploying for the first time, the target database will most
+likely be empty. The schema will be migrated when a new document type
+is seen, but this has a downside: it can be months before every
+supported document type has been seen once. This can make it difficult
+to test the database permissions.
+.P
+Since all of the test XML documents have old timestamps, one easy
+workaround is the following: simply import all of the test XML
+documents, and then delete them using whatever script is used to prune
+old entries. This will force the migration of the schema, after which
+you can set and test the database permissions.
+.P
+Something as simple as,
+.P
+.nf
+.I $ find ./test/xml -iname '*.xml' | xargs htsn-import -c foo.sqlite
+.fi
+.P
+should do it.