X-Git-Url: http://gitweb.michael.orlitzky.com/?a=blobdiff_plain;f=doc%2Fman1%2Fhtsn-import.1;h=cd922296fb272bfb7550571f46c858e593c61b2c;hb=a512aa4174a75fb48821a90909d3c61b6b369493;hp=6c64304bbac6db71737e1d7f6a949f09a348a417;hpb=160b2657d9dc02bfac55abe4c0e2fcba61de3da8;p=dead%2Fhtsn-import.git

diff --git a/doc/man1/htsn-import.1 b/doc/man1/htsn-import.1
index 6c64304..cd92229 100644
--- a/doc/man1/htsn-import.1
+++ b/doc/man1/htsn-import.1
@@ -23,9 +23,10 @@ database.
 .P
 First, we must parse the XML. Each supported document type (see below)
 has a full pickle/unpickle implementation (\(dqpickle\(dq is simply a
-synonym for serialize here). That means that we parse the entire
-document into a data structure, and if we pickle (serialize) that data
-structure, we get the exact same XML document tha we started with.
+synonym for \(dqserialize\(dq here). That means that we parse the
+entire document into a data structure, and if we pickle (serialize)
+that data structure, we get the exact same XML document tha we started
+with.
 .P
 This is important for two reasons. First, it serves as a second level
 of validation. The first validation is performed by the XML parser,
@@ -42,189 +43,8 @@ exist. We don't want the schema to change out from under us without
 warning, so it's important that no XML be parsed that would result in
 a different schema than we had previously. Since we can
 pickle/unpickle everything already, this should be impossible.
-
-.SH SUPPORTED DOCUMENT TYPES
 .P
-The XML document types obtained from the feed are uniquely identified
-by their DTDs. We currently support documents with the following DTDs:
-.IP \[bu] 2
-AutoRacingResultsXML.dtd
-.IP \[bu]
-Auto_Racing_Schedule_XML.dtd
-.IP \[bu]
-Heartbeat.dtd
-.IP \[bu]
-Injuries_Detail_XML.dtd
-.IP \[bu]
-injuriesxml.dtd
-.IP \[bu]
-newsxml.dtd
-.IP \[bu]
-Odds_XML.dtd
-.IP \[bu]
-scoresxml.dtd
-.IP \[bu]
-weatherxml.dtd
-.IP \[bu]
-GameInfo
-.RS
-.IP \[bu]
-CBASK_Lineup_XML.dtd
-.IP \[bu]
-cbaskpreviewxml.dtd
-.IP \[bu]
-cflpreviewxml.dtd
-.IP \[bu]
-Matchup_NBA_NHL_XML.dtd
-.IP \[bu]
-MLB_Gaming_Matchup_XML.dtd
-.IP \[bu]
-MLB_Lineup_XML.dtd
-.IP \[bu]
-MLB_Matchup_XML.dtd
-.IP \[bu]
-MLS_Preview_XML.dtd
-.IP \[bu]
-mlbpreviewxml.dtd
-.IP \[bu]
-NBA_Gaming_Matchup_XML.dtd
-.IP \[bu]
-NBA_Playoff_Matchup_XML.dtd
-.IP \[bu]
-NBALineupXML.dtd
-.IP \[bu]
-nbapreviewxml.dtd
-.IP \[bu]
-NCAA_FB_Preview_XML.dtd
-.IP \[bu]
-NFL_NCAA_FB_Matchup_XML.dtd
-.IP \[bu]
-nflpreviewxml.dtd
-.IP \[bu]
-nhlpreviewxml.dtd
-.IP \[bu]
-recapxml.dtd
-.IP \[bu]
-WorldBaseballPreviewXML.dtd
-.RE
-.IP \[bu]
-SportInfo
-.RS
-.IP \[bu]
-CBASK_3PPctXML.dtd
-.IP \[bu]
-Cbask_All_Tourn_Teams_XML.dtd
-.IP \[bu]
-CBASK_AssistsXML.dtd
-.IP \[bu]
-Cbask_Awards_XML.dtd
-.IP \[bu]
-CBASK_BlocksXML.dtd
-.IP \[bu]
-Cbask_Conf_Standings_XML.dtd
-.IP \[bu]
-Cbask_DivII_III_Indv_Stats_XML.dtd
-.IP \[bu]
-Cbask_DivII_Team_Stats_XML.dtd
-.IP \[bu]
-Cbask_DivIII_Team_Stats_XML.dtd
-.IP \[bu]
-CBASK_FGPctXML.dtd
-.IP \[bu]
-CBASK_FoulsXML.dtd
-.IP \[bu]
-CBASK_FTPctXML.dtd
-.IP \[bu]
-Cbask_Indv_Scoring_XML.dtd
-.IP \[bu]
-CBASK_MinutesXML.dtd
-.IP \[bu]
-Cbask_Polls_XML.dtd
-.IP \[bu]
-CBASK_ReboundsXML.dtd
-.IP \[bu]
-CBASK_ScoringLeadersXML.dtd
-.IP \[bu]
-Cbask_Team_ThreePT_Made_XML.dtd
-.IP \[bu]
-Cbask_Team_ThreePT_PCT_XML.dtd
-.IP \[bu]
-Cbask_Team_Win_Pct_XML.dtd
-.IP \[bu]
-Cbask_Top_Twenty_Five_XML.dtd
-.IP \[bu]
-CBASK_TopTwentyFiveResult_XML.dtd
-.IP \[bu]
-Cbask_Tourn_Awards_XML.dtd
-.IP \[bu]
-Cbask_Tourn_Champs_XML.dtd
-.IP \[bu]
-Cbask_Tourn_Indiv_XML.dtd
-.IP \[bu]
-Cbask_Tourn_Leaders_XML.dtd
-.IP \[bu]
-Cbask_Tourn_MVP_XML.dtd
-.IP \[bu]
-Cbask_Tourn_Records_XML.dtd
-.IP \[bu]
-LeagueScheduleXML.dtd
-.IP \[bu]
-minorscoresxml.dtd
-.IP \[bu]
-Minor_Baseball_League_Leaders_XML.dtd
-.IP \[bu]
-Minor_Baseball_Standings_XML.dtd
-.IP \[bu]
-Minor_Baseball_Transactions_XML.dtd
-.IP \[bu]
-mlbbattingavgxml.dtd
-.IP \[bu]
-mlbdoublesleadersxml.dtd
-.IP \[bu]
-MLBGamesPlayedXML.dtd
-.IP \[bu]
-MLBGIDPXML.dtd
-.IP \[bu]
-MLBHitByPitchXML.dtd
-.IP \[bu]
-mlbhitsleadersxml.dtd
-.IP \[bu]
-mlbhomerunsxml.dtd
-.IP \[bu]
-MLBHRFreqXML.dtd
-.IP \[bu]
-MLBIntWalksXML.dtd
-.IP \[bu]
-MLBKORateXML.dtd
-.IP \[bu]
-mlbonbasepctxml.dtd
-.IP \[bu]
-MLBOPSXML.dtd
-.IP \[bu]
-MLBPlateAppsXML.dtd
-.IP \[bu]
-mlbrbisxml.dtd
-.IP \[bu]
-mlbrunsleadersxml.dtd
-.IP \[bu]
-MLBSacFliesXML.dtd
-.IP \[bu]
-MLBSacrificesXML.dtd
-.IP \[bu]
-MLBSBSuccessXML.dtd
-.IP \[bu]
-mlbsluggingpctxml.dtd
-.IP \[bu]
-mlbstandxml.dtd
-.IP \[bu]
-mlbstandxml_preseason.dtd
-.IP \[bu]
-mlbstolenbasexml.dtd
-.IP \[bu]
-mlbtotalbasesleadersxml.dtd
-.IP \[bu]
-mlbtriplesleadersxml.dtd
-.RE
+A list of supported document types is given in the appendix.
 .P
 The GameInfo and SportInfo types do not have their own top-level
 tables in the database. Instead, their raw XML is stored in either the
@@ -281,55 +101,455 @@ tables (game_info and sport_info) still possess timestamps that allow
 us to prune old data.
 .P
 UML diagrams of the resulting database schema for each XML document
-type are provided with the \fBhtsn-import\fR documentation.
+type are provided with the \fBhtsn-import\fR documentation, in the
+\fIdoc/dbschema\fR directory. These are not authoritative, but it
+should be considered a bug if they are incorrect. The diagrams are
+created using the pgModeler <http://www.pgmodeler.com.br/> tool.
+
+.SH DATABASE SCHEMA COMPROMISES
+
+There are a few places that the database schema isn't exactly how we'd
+like it to be:
+
+.IP \[bu] 2
+\fIearlylineXML.dtd\fR
+
+The database representations for earlylineXML.dtd and
+MLB_earlylineXML.dtd are the same; that is, they share the same
+tables. The two document types represent team names in different
+ways. In order to accomodate both types with one parser, we had to
+make both ways optional, and then merge the two together before
+converting to the database representation.
+
+Unfortunately, when we merge two optional things together, we get
+another optional thing back. There's no way to say that \(dqat least
+one is not optional.\(dq So the team names in the database schema are
+optional as well, even though they should always be present.
+
+.SH NULL POLICY
+.P
+Normally in a database one makes a distinction between fields that
+simply don't exist, and those fields that are
+\(dqempty\(dq. Translating from XML, there is a natural way to
+determine which one should be used: if an element is present in the
+XML document but its contents are empty, then an empty string should
+be inserted into the corresponding field. If on the other hand the
+element is missing entirely, the corresponding database entry should
+be NULL to indicate that fact.
+.P
+This sounds well and good, but the XML must be consistent for the
+database consumer to make any sense of what he sees. The feed XML uses
+optional and blank elements interchangeably, and without any
+discernable pattern. To propagate this pattern into the database would
+only cause confusion.
+.P
+As a result, a policy was adopted: both optional elements and elements
+whose contents can be empty will be considered nullable in the
+database. If the element is missing, the corresponding field is
+NULL. Likewise if the content is simply missing. That means there
+should never be a (completely) empty string in a database column.
+
+.SH XML SCHEMA GENERATION
+.P
+In order to parse XML, you need to know the structure of your
+documents. Usually this is given in the form of a DTD or schema. The
+Sports Network does provide DTDs for their XML, but unfortunately many
+of them do not match the XML found on the feed.
+.P
+We need to construct a database into which to insert the XML. How do
+we know if <game> should be a column, or if it should have its own
+table? We need to know how many times it can appear in the
+document. So we need some form of specification. Since the supplied
+DTDs are incorrect, we would like to generate them automatically.
+.P
+The process should go something like,
+.IP 1.
+Generate a DTD from the first foo.xml file we see. Call it foo.dtd.
+.IP 2.
+Validate future foo documents against foo.dtd. If they all validate,
+great. If one fails, add it to the corpus and update foo.dtd so
+that both the original and the new foo.xml validate.
+.IP 3.
+Repeat until no more failures occur. This can never be perfect:
+tomorrow we could get a foo.xml that's wildly different from what
+we've seen in the past. But it's the best we can hope for under
+the circumstances.
+.P
+Enter XML-Schema-learner
+<https://github.com/kore/XML-Schema-learner>. This tool can infer a
+DTD from a set of sample XML files. The top-level \(dqschemagen\(dq
+folder (in this project) contains a number of subfolders\(emone for
+each type of document that we want to parse. Contained therein are XML
+samples for that particular document type. These were hand-picked one
+at a time according to the procedure above, and the complete set of
+XML is what we use to generate the DTDs used by htsn-import.
+.P
+To generate them, run `make schema` at the project
+root. XML-Schema-learner will be invoked on each subfolder of
+\(dqschemagen\(dq and will output the corresponding DTDs to the
+\(dqschemagen\(dq folder.
+.P
+Most of the production schemas are generated this way; however, a few
+needed manual tweaking. The final, believed-to-be-correct schemas for
+all supported document types can be found in the \(dqschema\(dq folder in
+the project root. Having the correct DTDs available means you
+don't need XML-Schema-learner available to install \fBhtsn-import\fR.
+
+.SH XML SCHEMA UPDATES
+.P
+If a new tag is added to an XML document type, \fBhtsn-import\fR will
+most likely refuse to parse it, since the new documents no longer
+match the existing DTD.
+.P
+The first thing to do in that case is add the unparseable document to
+the \(dqschemagen\(dq directory, and generate a new DTD that matches
+both the old and new samples. Once a new, correct DTD has been
+generated, it should be added to the \(dqschema\(dq directory. Then,
+the parser can be updated and \fBhtsn-import\fR rebuilt.
+.P
+At this point, \fBhtsn-import\fR should be capable of importing the
+new document. But the addition of the new tag will most require new
+fields in the database. Fortunately, easy migrations like this are
+handled automatically. As an example, at one point, \fIOdds_XML.dtd\fR
+did not contain the \(dqHStarter\(dq and \(dqAStarter\(dq elements
+associated with its games. Suppose we parse one of the old documents
+(without \(dqHStarter\(dq and \(dqAStarter\(dq) using an old version
+of \fBhtsn-import\fR:
+.P
+.nf
+.I $ htsn-import --connection-string='foo.sqlite3' \\\\
+.I "              schemagen/Odds_XML/19996433.xml"
+Migration: CREATE TABLE \(dqodds\(dq ...
+Successfully imported schemagen/Odds_XML/19996433.xml.
+Processed 1 document(s) total.
+.fi
+.P
+At this point, the database schema matches the old documents, that is,
+the ones without \fIAStarter\fR and \fIHStarter\fR. If we use a new
+version of \fBhtsn-import\fR, supporting the new fields, the migration
+is handled gracefully:
+.P
+.nf
+.I $ htsn-import --connection-string='foo.sqlite3' \\\\
+.I "              schemagen/Odds_XML/21315768.xml"
+Migration: ALTER TABLE \(dqodds_games\(dq
+           ADD COLUMN \(dqaway_team_starter_id\(dq INTEGER;
+Migration: ALTER TABLE \(dqodds_games\(dq
+           ADD COLUMN \(dqaway_team_starter_name\(dq VARCHAR;
+Migration: ALTER TABLE \(dqodds_games\(dq
+           ADD COLUMN \(dqhome_team_starter_id\(dq INTEGER;
+Migration: ALTER TABLE \(dqodds_games\(dq
+           ADD COLUMN \(dqhome_team_starter_name\(dq VARCHAR;
+Successfully imported schemagen/Odds_XML/21315768.xml.
+Processed 1 document(s) total.
+.fi
+.P
+If fields are removed from the schema, then manual intervention may be
+necessary:
+.P
+.nf
+.I $ htsn-import -b Postgres -c 'dbname=htsn user=postgres' \\\\
+.I "              schemagen/Odds_XML/19996433.xml"
+ERROR: Database migration: manual intervention required.
+The following actions are considered unsafe:
+ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqaway_team_starter_id\(dq
+ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqaway_team_starter_name\(dq
+ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqhome_team_starter_id\(dq
+ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqhome_team_starter_name\(dq
+
+ERROR: Failed to import file schemagen/Odds_XML/19996433.xml.
+Processed 0 document(s) total.
+.fi
+.P
+To fix these errors, manually invoke the SQL commands that were
+considered unsafe:
+.P
+.nf
+.I $ psql -U postgres -d htsn \\\\
+.I "       -c 'ALTER TABLE odds_games DROP COLUMN away_team_starter_id;'"
+ALTER TABLE
+.I $ psql -U postgres -d htsn \\\\
+.I "       -c 'ALTER TABLE odds_games DROP COLUMN away_team_starter_name;'"
+ALTER TABLE
+.I $ psql -U postgres -d htsn \\\\
+.I "       -c 'ALTER TABLE odds_games DROP COLUMN home_team_starter_id;'"
+ALTER TABLE
+.I $ psql -U postgres -d htsn \\\\
+.I "       -c 'ALTER TABLE odds_games DROP COLUMN home_team_starter_name;'"
+ALTER TABLE
+.fi
+.P
+After manually adjusting the schema, the import should succeed.
 
-.SH XML Schema Oddities
+.SH XML SCHEMA ODDITIES
 .P
 There are a number of problems with the XML on the wire. Even if we
 construct the DTDs ourselves, the results are sometimes
 inconsistent. Here we document a few of them.
 
 .IP \[bu] 2
-Odds_XML.dtd
+\fInewsxml.dtd\fR
+
+The TSN DTD for news (and almost all XML on the wire) suggests that
+there is a exactly one (possibly-empty) <SMS> element present in each
+message. However, we have seen an example (XML_File_ID 21232353) where
+an empty <SMS> followed a non-empty one:
+
+.fi
+<SMS>Odd Man Rush: Snow under pressure to improve Isles quickly</SMS>
+<SMS></SMS>
+.nf
+
+We don't parse this case at the moment, but we do recognize it and report
+it as unsupported so that offending documents can be removed. An example
+is provided as test/xml/newsxml-multiple-sms.xml.
+
+.IP \[bu]
+\fIMLB_earlylineXML.dtd\fR
+
+Unlike earlylineXML.dtd, this document type has more than one <game>
+associated with each <date>. Moreover, each <date> has a bunch of
+<note> children that are supposed to be associated with the <game>s,
+but the document structure indicates no explicit relationship. For
+example,
+
+.nf
+<date>
+  <note>...</note>
+  <game>...</game>
+  <game>...</game>
+  <note>...</note>
+  <game>...</game>
+</date>
+.fi
+
+Here the first <note> is inferred to apply to the two <game>s that
+follow it, and the second <note> applies to the single <game> that
+follows it. But this is very fragile to parse. Instead, we use a hack
+to facilitate (un)pickling, and then drop the notes entirely during
+the database conversion.
+
+A similar workaround is implemented for Odds_XML.dtd.
+
+.IP \[bu]
+\fIOdds_XML.dtd\fR
 
 The <Notes> elements here are supposed to be associated with a set of
 <Game> elements, but since the pair
 (<Notes>...</Notes><Game>...</Game>) can appear zero or more times,
 this leads to ambiguity in parsing. We therefore ignore the notes
-entirely (although a hack is employed to facilitate parsing).
+entirely (although a hack is employed to facilitate parsing). The same
+thing goes for the newer <League_Name> element.
 
 .IP \[bu]
-weatherxml.dtd
+\fIweatherxml.dtd\fR
 
 There appear to be two types of weather documents; the first has
 <listing> contained within <forecast> and the second has <forecast>
 contained within <listing>. While it would be possible to parse both,
 it would greatly complicate things. The first form is more common, so
-that's all we support for now.
+that's all we support for now. An example is provided as
+test/xml/weatherxml-type2.xml.
 
-.SH OPTIONS
+We are however able to identify the second type. When one is
+encountered, an informational message (that it is unsupported) will be
+printed. If the \fI\-\-remove\fR flag is used, the file will be
+deleted. This prevents documents that we know we can't import from
+building up.
 
-.IP \fB\-\-backend\fR,\ \fB\-b\fR
-The RDBMS backend to use. Valid choices are \fISqlite\fR and
-\fIPostgres\fR. Capitalization is important, sorry.
+Another problem that comes up occasionally is that the home and away
+team elements appear in the reverse order. As in the other case, we
+report these as unsupported and then \(dqsucceed\(dq so that the
+offending document can be removed if desired. An example is provided
+as test/xml/weatherxml-backwards-teams.xml.
 
-Default: Sqlite
+.SH DATE/TIME ISSUES
 
-.IP \fB\-\-connection-string\fR,\ \fB\-c\fR
-The connection string used for connecting to the database backend
-given by the \fB\-\-backend\fR option. The default is appropriate for
-the \fISqlite\fR backend.
+Dates and times appear in a number of places on the feed. The date
+portions are usually, fine, but the times often lack important
+information such as the time zone, or whether \(dq8 o'clock\(dq means
+a.m. or p.m.
 
-Default: \(dq:memory:\(dq
+The most pervasive issue occurs with the timestamps that are included
+in every message. A typical timestamp looks like,
 
-.IP \fB\-\-log-file\fR
-If you specify a file here, logs will be written to it (possibly in
-addition to syslog). Can be either a relative or absolute path. It
-will not be auto-rotated; use something like logrotate for that.
+.nf
+<time_stamp> May 24, 2014, at 04:18 PM ET </time_stamp>
+.fi
 
-Default: none
+The \(dqtime zone\(dq is given as \(dqET\(dq, but unfortunately
+\(dqET\(dq is not a valid time zone. It stands for \(dqEastern
+Time\(dq, which can belong to either of two time zones, EST or EDT,
+based on the time of the year (that is, whether or not daylight
+savings time is in effect) and one's location (for example, Arizona
+doesn't observe daylight savings time). It's not much more useful to
+be off by one hour than it is to be off by five hours, and since we
+can't determine the true offset from the timestamp, we always parse
+and store these as UTC.
 
-.IP \fB\-\-log-level\fR
+Here's a list of the ones that may cause surprises:
+
+.IP \[bu] 2
+\fIAutoRacingResultsXML.dtd\fR
+
+The <RaceDate> elements contain a full date and time, but no time zone
+information:
+
+.nf
+<RaceDate>5/24/2014 2:45:00 PM</RaceDate>
+.fi
+
+We parse them as UTC, which will be wrong when stored,
+but \(dqcorrect\(dq if the new UTC time zone is ignored.
+
+.IP \[bu]
+\fIAuto_Racing_Schedule_XML.dtd\fR
+
+The <Race_Date> and <Race_Time> elements are combined into on field in
+the database, but no time zone information is given. For example,
+
+.nf
+<Race_Date>02/16/2013</Race_Date>
+<Race_Time>08:10 PM</Race_Time>
+.fi
+
+As a result, we parse and store the times as UTC. The race times are
+not always present in the database, but when they are missing, they
+are presented as \(dqTBA\(dq (to be announced):
+
+.nf
+<Race_Time>TBA</Race_Time>
+.fi
+
+Since the dates do not appear to be optional, we store only the race
+date in that case.
+
+.IP \[bu]
+\fIearlylineXML.dtd\fR
+
+The <time> elements in the early lines contain neither a time zone nor
+an am/pm identifier:
+
+.nf
+<time>8:30</time>
+.fi
+
+The times are parsed and stored as UTC, since we
+don't have any other information upon which to base a guess. Even if
+one ignores the UTC time zone, the time can possibly be off by 12
+hours (due to the a.m./p.m. issue).
+
+The game <time> elements can also be empty. Since we store the
+combined game date/time in one field, these games will appear to begin
+at midnight on the day they occur.
+
+.IP \[bu]
+\fIjfilexml.dtd\fR
+
+The <Game_Date> and <Game_Time> elements are combined into on field in
+the database, but no time zone information is given. For example,
+
+.nf
+<Game_Date>06/15/2014</Game_Date>
+<Game_Time>08:00 PM</Game_Time>
+.fi
+
+As a result, we parse and store the times as UTC.
+
+The <CurrentTimestamp> elements suffer a similar problem, sans the
+date:
+
+.nf
+<CurrentTimeStamp>11:30 A.M.</CurrentTimeStamp>
+.fi
+
+They are also stored as UTC.
+
+.IP \[bu]
+\fIMLB_earlylineXML.dtd\fR
+
+See earlylineXML.dtd.
+
+.IP \[bu]
+\fIOdds_XML.dtd\fR
+
+The <Game_Date> and <Game_Time> elements are combined into on field in
+the database, but no time zone information is given. For example,
+
+.nf
+<Game_Date>01/04/2014</Game_Date>
+<Game_Time>04:35 PM</Game_Time>
+.fi
+
+As a result, we parse and store the times as UTC.
+
+.IP \[bu]
+\fISchedule_Changes_XML.dtd\fR
+
+The <Game_Date> and <Game_Time> elements are combined into on field in
+the database, but no time zone information is given. For example,
+
+.nf
+<Game_Date>06/06/2014</Game_Date>
+<Game_Time>04:00 PM</Game_Time>
+.fi
+
+As a result, we parse and store the times as UTC. The game times are
+not always present in the database, but when they are missing, they
+are presented as \(dqTBA\(dq (to be announced):
+
+.nf
+<Game_Time>TBA</Game_Time>
+.fi
+
+Since the dates do not appear to be optional, we store only the game
+date in that case.
+
+.SH DEPLOYMENT
+.P
+When deploying for the first time, the target database will most
+likely be empty. The schema will be migrated when a new document type
+is seen, but this has a downside: it can be months before every
+supported document type has been seen once. This can make it difficult
+to test the database permissions.
+.P
+Since all of the test XML documents have old timestamps, one easy
+workaround is the following: simply import all of the test XML
+documents, and then delete them using whatever script is used to prune
+old entries. This will force the migration of the schema, after which
+you can set and test the database permissions.
+.P
+Something as simple as,
+.P
+.nf
+.I $ find ./test/xml -iname '*.xml' | xargs htsn-import -c foo.sqlite
+.fi
+.P
+should do it.
+
+.SH OPTIONS
+
+.IP \fB\-\-backend\fR,\ \fB\-b\fR
+The RDBMS backend to use. Valid choices are \fISqlite\fR and
+\fIPostgres\fR. Capitalization is important, sorry.
+
+Default: Sqlite
+
+.IP \fB\-\-connection-string\fR,\ \fB\-c\fR
+The connection string used for connecting to the database backend
+given by the \fB\-\-backend\fR option. The default is appropriate for
+the \fISqlite\fR backend.
+
+Default: \(dq:memory:\(dq
+
+.IP \fB\-\-log-file\fR
+If you specify a file here, logs will be written to it (possibly in
+addition to syslog). Can be either a relative or absolute path. It
+will not be auto-rotated; use something like logrotate for that.
+
+Default: none
+
+.IP \fB\-\-log-level\fR
 How verbose should the logs be? We log notifications at four levels:
 DEBUG, INFO, WARN, and ERROR. Specify the \(dqmost boring\(dq level of
 notifications you would like to receive (in all-caps); more
@@ -407,3 +627,372 @@ Imported 1 document(s) total.
 
 .P
 Send bugs to michael@orlitzky.com.
+
+.SH APPENDIX: SUPPORTED DOCUMENT TYPES
+.P
+The XML document types obtained from the feed are uniquely identified
+by their DTDs. We currently support documents with the following DTDs:
+.IP \[bu] 2
+AutoRacingResultsXML.dtd
+.IP \[bu]
+Auto_Racing_Schedule_XML.dtd
+.IP \[bu]
+earlylineXML.dtd
+.IP \[bu]
+Heartbeat.dtd
+.IP \[bu]
+Injuries_Detail_XML.dtd
+.IP \[bu]
+injuriesxml.dtd
+.IP \[bu]
+jfilexml.dtd
+.IP \[bu]
+MLB_earlylineXML.dtd
+.IP \[bu]
+newsxml.dtd
+.IP \[bu]
+Odds_XML.dtd
+.IP \[bu]
+Schedule_Changes_XML.dtd
+.IP \[bu]
+scoresxml.dtd
+.IP \[bu]
+weatherxml.dtd
+.IP \[bu]
+GameInfo
+.RS
+.IP \[bu] 2
+CBASK_Lineup_XML.dtd
+.IP \[bu]
+cbaskpreviewxml.dtd
+.IP \[bu]
+cflpreviewxml.dtd
+.IP \[bu]
+Matchup_NBA_NHL_XML.dtd
+.IP \[bu]
+MLB_Fielding_XML.dtd
+.IP \[bu]
+MLB_Gaming_Matchup_XML.dtd
+.IP \[bu]
+MLB_Lineup_XML.dtd
+.IP \[bu]
+MLB_Matchup_XML.dtd
+.IP \[bu]
+MLS_Preview_XML.dtd
+.IP \[bu]
+mlbpreviewxml.dtd
+.IP \[bu]
+NBA_Gaming_Matchup_XML.dtd
+.IP \[bu]
+NBA_Playoff_Matchup_XML.dtd
+.IP \[bu]
+NBALineupXML.dtd
+.IP \[bu]
+nbapreviewxml.dtd
+.IP \[bu]
+NCAA_FB_Preview_XML.dtd
+.IP \[bu]
+NFL_NCAA_FB_Matchup_XML.dtd
+.IP \[bu]
+nflpreviewxml.dtd
+.IP \[bu]
+nhlpreviewxml.dtd
+.IP \[bu]
+recapxml.dtd
+.IP \[bu]
+WorldBaseballPreviewXML.dtd
+.RE
+.IP \[bu]
+SportInfo
+.RS
+.IP \[bu] 2
+CBASK_3PPctXML.dtd
+.IP \[bu]
+Cbask_All_Tourn_Teams_XML.dtd
+.IP \[bu]
+CBASK_AssistsXML.dtd
+.IP \[bu]
+Cbask_Awards_XML.dtd
+.IP \[bu]
+CBASK_BlocksXML.dtd
+.IP \[bu]
+Cbask_Conf_Standings_XML.dtd
+.IP \[bu]
+Cbask_DivII_III_Indv_Stats_XML.dtd
+.IP \[bu]
+Cbask_DivII_Team_Stats_XML.dtd
+.IP \[bu]
+Cbask_DivIII_Team_Stats_XML.dtd
+.IP \[bu]
+CBASK_FGPctXML.dtd
+.IP \[bu]
+CBASK_FoulsXML.dtd
+.IP \[bu]
+CBASK_FTPctXML.dtd
+.IP \[bu]
+Cbask_Indv_No_Avg_XML
+.IP \[bu]
+Cbask_Indv_Scoring_XML.dtd
+.IP \[bu]
+Cbask_Indv_Shooting_XML.dtd
+.IP \[bu]
+CBASK_MinutesXML.dtd
+.IP \[bu]
+Cbask_Polls_XML.dtd
+.IP \[bu]
+CBASK_ReboundsXML.dtd
+.IP \[bu]
+CBASK_ScoringLeadersXML.dtd
+.IP \[bu]
+Cbask_Team_Scoring_Rebound_Margin_XML.dtd
+.IP \[bu]
+Cbask_Team_Scoring_XML.dtd
+.IP \[bu]
+Cbask_Team_Shooting_Pct_XML.dtd
+.IP \[bu]
+Cbask_Team_ThreePT_Made_XML.dtd
+.IP \[bu]
+Cbask_Team_ThreePT_PCT_XML.dtd
+.IP \[bu]
+Cbask_Team_Win_Pct_XML.dtd
+.IP \[bu]
+Cbask_Top_Twenty_Five_XML.dtd
+.IP \[bu]
+CBASK_TopTwentyFiveResult_XML.dtd
+.IP \[bu]
+Cbask_Tourn_Awards_XML.dtd
+.IP \[bu]
+Cbask_Tourn_Champs_XML.dtd
+.IP \[bu]
+Cbask_Tourn_Indiv_XML.dtd
+.IP \[bu]
+Cbask_Tourn_Leaders_XML.dtd
+.IP \[bu]
+Cbask_Tourn_MVP_XML.dtd
+.IP \[bu]
+Cbask_Tourn_Records_XML.dtd
+.IP \[bu]
+LeagueScheduleXML.dtd
+.IP \[bu]
+minorscoresxml.dtd
+.IP \[bu]
+Minor_Baseball_League_Leaders_XML.dtd
+.IP \[bu]
+Minor_Baseball_Standings_XML.dtd
+.IP \[bu]
+Minor_Baseball_Transactions_XML.dtd
+.IP \[bu]
+mlbbattingavgxml.dtd
+.IP \[bu]
+mlbdoublesleadersxml.dtd
+.IP \[bu]
+MLBGamesPlayedXML.dtd
+.IP \[bu]
+MLBGIDPXML.dtd
+.IP \[bu]
+MLBHitByPitchXML.dtd
+.IP \[bu]
+mlbhitsleadersxml.dtd
+.IP \[bu]
+mlbhomerunsxml.dtd
+.IP \[bu]
+MLBHRFreqXML.dtd
+.IP \[bu]
+MLBIntWalksXML.dtd
+.IP \[bu]
+MLBKORateXML.dtd
+.IP \[bu]
+mlbonbasepctxml.dtd
+.IP \[bu]
+MLBOPSXML.dtd
+.IP \[bu]
+MLBPlateAppsXML.dtd
+.IP \[bu]
+mlbrbisxml.dtd
+.IP \[bu]
+mlbrunsleadersxml.dtd
+.IP \[bu]
+MLBSacFliesXML.dtd
+.IP \[bu]
+MLBSacrificesXML.dtd
+.IP \[bu]
+MLBSBSuccessXML.dtd
+.IP \[bu]
+mlbsluggingpctxml.dtd
+.IP \[bu]
+mlbstandxml.dtd
+.IP \[bu]
+mlbstandxml_preseason.dtd
+.IP \[bu]
+mlbstolenbasexml.dtd
+.IP \[bu]
+mlbtotalbasesleadersxml.dtd
+.IP \[bu]
+mlbtriplesleadersxml.dtd
+.IP \[bu]
+MLBWalkRateXML.dtd
+.IP \[bu]
+mlbwalksleadersxml.dtd
+.IP \[bu]
+MLBXtraBaseHitsXML.dtd
+.IP \[bu]
+MLB_Pitching_Appearances_Leaders.dtd
+.IP \[bu]
+MLB_ERA_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_Balks_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_CG_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_ER_Allowed_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_Hits_Allowed_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_Hit_Batters_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_HR_Allowed_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_IP_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_Runs_Allowed_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_Saves_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_Shut_Outs_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_Starts_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_Strike_Outs_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_Walks_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_WHIP_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_Wild_Pitches_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_Win_Percentage_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_WL_Leaders.dtd
+.IP \[bu]
+NBA_Team_Stats_XML.dtd
+.IP \[bu]
+NBA3PPctXML.dtd
+.IP \[bu]
+NBAAssistsXML.dtd
+.IP \[bu]
+NBABlocksXML.dtd
+.IP \[bu]
+nbaconfrecxml.dtd
+.IP \[bu]
+nbadaysxml.dtd
+.IP \[bu]
+nbadivisionsxml.dtd
+.IP \[bu]
+NBAFGPctXML.dtd
+.IP \[bu]
+NBAFoulsXML.dtd
+.IP \[bu]
+NBAFTPctXML.dtd
+.IP \[bu]
+NBAMinutesXML.dtd
+.IP \[bu]
+NBAReboundsXML.dtd
+.IP \[bu]
+NBAScorersXML.dtd
+.IP \[bu]
+nbastandxml.dtd
+.IP \[bu]
+NBAStealsXML.dtd
+.IP \[bu]
+nbateamleadersxml.dtd
+.IP \[bu]
+nbatripledoublexml.dtd
+.IP \[bu]
+NBATurnoversXML.dtd
+.IP \[bu]
+NCAA_Conference_Schedule_XML.dtd
+.IP \[bu]
+nflfirstdownxml.dtd
+.IP \[bu]
+NFLFumbleLeaderXML.dtd
+.IP \[bu]
+NFLGiveTakeXML.dtd
+.IP \[bu]
+NFLGrassTurfDomeOutsideXML.dtd
+.IP \[bu]
+NFLInside20XML.dtd
+.IP \[bu]
+NFLInterceptionLeadersXML.dtd
+.IP \[bu]
+NFLKickoffsXML.dtd
+.IP \[bu]
+NFLMondayNightXML.dtd
+.IP \[bu]
+NFLPassingLeadersXML.dtd
+.IP \[bu]
+NFLPassLeadXML.dtd
+.IP \[bu]
+NFLQBStartsXML.dtd
+.IP \[bu]
+NFLReceivingLeadersXML.dtd
+.IP \[bu]
+NFLRushingLeadersXML.dtd
+.IP \[bu]
+NFLSackLeadersXML.dtd
+.IP \[bu]
+nflstandxml.dtd
+.IP \[bu]
+NFLTackleFFLeadersXML.dtd
+.IP \[bu]
+NFLTeamRankingsXML.dtd
+.IP \[bu]
+NFLTopKickoffReturnXML.dtd
+.IP \[bu]
+NFLTopPerformanceXML.dtd
+.IP \[bu]
+NFLTopPuntReturnXML.dtd
+.IP \[bu]
+NFLTotalYardageXML.dtd
+.IP \[bu]
+NFLYardsXML.dtd
+.IP \[bu]
+NFL_KickingLeaders_XML.dtd
+.IP \[bu]
+NFL_NBA_Draft_XML.dtd
+.IP \[bu]
+NFL_PuntingLeaders_XML.dtd
+.IP \[bu]
+NFL_Roster_XML.dtd
+.IP \[bu]
+NFL_Team_Stats_XML.dtd
+.IP \[bu]
+Transactions_XML.dtd
+.IP \[bu]
+Weekly_Sched_XML.dtd
+.IP \[bu]
+WNBA_Team_Leaders_XML.dtd
+.IP \[bu]
+WNBA3PPctXML.dtd
+.IP \[bu]
+WNBAAssistsXML.dtd
+.IP \[bu]
+WNBABlocksXML.dtd
+.IP \[bu]
+WNBAFGPctXML.dtd
+.IP \[bu]
+WNBAFoulsXML.dtd
+.IP \[bu]
+WNBAFTPctXML.dtd
+.IP \[bu]
+WNBAMinutesXML.dtd
+.IP \[bu]
+WNBAReboundsXML.dtd
+.IP \[bu]
+WNBAScorersXML.dtd
+.IP \[bu]
+wnbastandxml.dtd
+.IP \[bu]
+WNBAStealsXML.dtd
+.IP \[bu]
+WNBATurnoversXML.dtd
+.RE