X-Git-Url: http://gitweb.michael.orlitzky.com/?a=blobdiff_plain;f=doc%2Fman1%2Fhtsn-import.1;h=256dfd5980a3cc015d90e734998b366fff759cc2;hb=1f90ca373a3239b4903d663c021f0e9f2616d379;hp=0816be35d33cc5eb8a035dc551e40d7e05f1b44b;hpb=ec6c5b56f8e3096786e8f0a0d3c5c3c1610b69f7;p=dead%2Fhtsn-import.git

diff --git a/doc/man1/htsn-import.1 b/doc/man1/htsn-import.1
index 0816be3..256dfd5 100644
--- a/doc/man1/htsn-import.1
+++ b/doc/man1/htsn-import.1
@@ -8,26 +8,428 @@ htsn-import \- Import XML files from The Sports Network into an RDBMS.
 \fBhtsn-import\fR [OPTIONS] [FILES]
 
 .SH DESCRIPTION
+.P
+The Sports Network <http://www.sportsnetwork.com/> offers an XML feed
+containing various sports news and statistics. Our sister program
+\fBhtsn\fR is capable of retrieving the feed and saving the individual
+XML documents contained therein. But what to do with them?
+.P
+The purpose of \fBhtsn-import\fR is to take these XML documents and
+get them into something we can use, a relational database management
+system (RDBMS), otherwise known as a SQL database. The structure of
+relational database, is, well, relational, and the feed XML is not. So
+there is some work to do before the data can be imported into the
+database.
+.P
+First, we must parse the XML. Each supported document type (see below)
+has a full pickle/unpickle implementation (\(dqpickle\(dq is simply a
+synonym for serialize here). That means that we parse the entire
+document into a data structure, and if we pickle (serialize) that data
+structure, we get the exact same XML document tha we started with.
+.P
+This is important for two reasons. First, it serves as a second level
+of validation. The first validation is performed by the XML parser,
+but if that succeeds and unpicking fails, we know that something is
+fishy. Second, we don't ever want to be surprised by some new element
+or attribute showing up in the XML. The fact that we can unpickle the
+whole thing now means that we won't be surprised in the future.
+.P
+The aforementioned feature is especially important because we
+automatically migrate the database schema every time we import a
+document. If you attempt to import a \(dqnewsxml.dtd\(dq document, all
+database objects relating to the news will be created if they do not
+exist. We don't want the schema to change out from under us without
+warning, so it's important that no XML be parsed that would result in
+a different schema than we had previously. Since we can
+pickle/unpickle everything already, this should be impossible.
+
+.SH SUPPORTED DOCUMENT TYPES
+.P
+The XML document types obtained from the feed are uniquely identified
+by their DTDs. We currently support documents with the following DTDs:
+.IP \[bu] 2
+AutoRacingResultsXML.dtd
+.IP \[bu]
+Auto_Racing_Schedule_XML.dtd
+.IP \[bu]
+Heartbeat.dtd
+.IP \[bu]
+Injuries_Detail_XML.dtd
+.IP \[bu]
+injuriesxml.dtd
+.IP \[bu]
+newsxml.dtd
+.IP \[bu]
+Odds_XML.dtd
+.IP \[bu]
+scoresxml.dtd
+.IP \[bu]
+weatherxml.dtd
+.IP \[bu]
+GameInfo
+.RS
+.IP \[bu]
+CBASK_Lineup_XML.dtd
+.IP \[bu]
+cbaskpreviewxml.dtd
+.IP \[bu]
+cflpreviewxml.dtd
+.IP \[bu]
+Matchup_NBA_NHL_XML.dtd
+.IP \[bu]
+MLB_Gaming_Matchup_XML.dtd
+.IP \[bu]
+MLB_Lineup_XML.dtd
+.IP \[bu]
+MLB_Matchup_XML.dtd
+.IP \[bu]
+MLS_Preview_XML.dtd
+.IP \[bu]
+mlbpreviewxml.dtd
+.IP \[bu]
+NBA_Gaming_Matchup_XML.dtd
+.IP \[bu]
+NBA_Playoff_Matchup_XML.dtd
+.IP \[bu]
+NBALineupXML.dtd
+.IP \[bu]
+nbapreviewxml.dtd
+.IP \[bu]
+NCAA_FB_Preview_XML.dtd
+.IP \[bu]
+NFL_NCAA_FB_Matchup_XML.dtd
+.IP \[bu]
+nflpreviewxml.dtd
+.IP \[bu]
+nhlpreviewxml.dtd
+.IP \[bu]
+recapxml.dtd
+.IP \[bu]
+WorldBaseballPreviewXML.dtd
+.RE
+.IP \[bu]
+SportInfo
+.RS
+.IP \[bu]
+CBASK_3PPctXML.dtd
+.IP \[bu]
+Cbask_All_Tourn_Teams_XML.dtd
+.IP \[bu]
+CBASK_AssistsXML.dtd
+.IP \[bu]
+Cbask_Awards_XML.dtd
+.IP \[bu]
+CBASK_BlocksXML.dtd
+.IP \[bu]
+Cbask_Conf_Standings_XML.dtd
+.IP \[bu]
+Cbask_DivII_III_Indv_Stats_XML.dtd
+.IP \[bu]
+Cbask_DivII_Team_Stats_XML.dtd
+.IP \[bu]
+Cbask_DivIII_Team_Stats_XML.dtd
+.IP \[bu]
+CBASK_FGPctXML.dtd
+.IP \[bu]
+CBASK_FoulsXML.dtd
+.IP \[bu]
+CBASK_FTPctXML.dtd
+.IP \[bu]
+Cbask_Indv_Scoring_XML.dtd
+.IP \[bu]
+CBASK_MinutesXML.dtd
+.IP \[bu]
+Cbask_Polls_XML.dtd
+.IP \[bu]
+CBASK_ReboundsXML.dtd
+.IP \[bu]
+CBASK_ScoringLeadersXML.dtd
+.IP \[bu]
+Cbask_Team_ThreePT_Made_XML.dtd
+.IP \[bu]
+Cbask_Team_ThreePT_PCT_XML.dtd
+.IP \[bu]
+Cbask_Team_Win_Pct_XML.dtd
+.IP \[bu]
+Cbask_Top_Twenty_Five_XML.dtd
+.IP \[bu]
+CBASK_TopTwentyFiveResult_XML.dtd
+.IP \[bu]
+Cbask_Tourn_Awards_XML.dtd
+.IP \[bu]
+Cbask_Tourn_Champs_XML.dtd
+.IP \[bu]
+Cbask_Tourn_Indiv_XML.dtd
+.IP \[bu]
+Cbask_Tourn_Leaders_XML.dtd
+.IP \[bu]
+Cbask_Tourn_MVP_XML.dtd
+.IP \[bu]
+Cbask_Tourn_Records_XML.dtd
+.IP \[bu]
+LeagueScheduleXML.dtd
+.IP \[bu]
+minorscoresxml.dtd
+.IP \[bu]
+Minor_Baseball_League_Leaders_XML.dtd
+.IP \[bu]
+Minor_Baseball_Standings_XML.dtd
+.IP \[bu]
+Minor_Baseball_Transactions_XML.dtd
+.IP \[bu]
+mlbbattingavgxml.dtd
+.IP \[bu]
+mlbdoublesleadersxml.dtd
+.IP \[bu]
+MLBGamesPlayedXML.dtd
+.IP \[bu]
+MLBGIDPXML.dtd
+.IP \[bu]
+MLBHitByPitchXML.dtd
+.IP \[bu]
+mlbhitsleadersxml.dtd
+.IP \[bu]
+mlbhomerunsxml.dtd
+.IP \[bu]
+MLBHRFreqXML.dtd
+.IP \[bu]
+MLBIntWalksXML.dtd
+.IP \[bu]
+MLBKORateXML.dtd
+.IP \[bu]
+mlbonbasepctxml.dtd
+.IP \[bu]
+MLBOPSXML.dtd
+.IP \[bu]
+MLBPlateAppsXML.dtd
+.IP \[bu]
+mlbrbisxml.dtd
+.IP \[bu]
+mlbrunsleadersxml.dtd
+.IP \[bu]
+MLBSacFliesXML.dtd
+.IP \[bu]
+MLBSacrificesXML.dtd
+.IP \[bu]
+MLBSBSuccessXML.dtd
+.IP \[bu]
+mlbsluggingpctxml.dtd
+.IP \[bu]
+mlbstandxml.dtd
+.IP \[bu]
+mlbstandxml_preseason.dtd
+.IP \[bu]
+mlbstolenbasexml.dtd
+.IP \[bu]
+mlbtotalbasesleadersxml.dtd
+.IP \[bu]
+mlbtriplesleadersxml.dtd
+.IP \[bu]
+MLBWalkRateXML.dtd
+.IP \[bu]
+mlbwalksleadersxml.dtd
+.IP \[bu]
+MLBXtraBaseHitsXML.dtd
+.IP \[bu]
+MLB_Pitching_Appearances_Leaders.dtd
+.IP \[bu]
+MLB_ERA_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_Balks_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_CG_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_ER_Allowed_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_Hits_Allowed_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_Hit_Batters_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_HR_Allowed_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_IP_Leaders.dtd
+.IP \[bu]
+MLB_Pitching_Runs_Allowed_Leaders.dtd
+.RE
+.P
+The GameInfo and SportInfo types do not have their own top-level
+tables in the database. Instead, their raw XML is stored in either the
+\(dqgame_info\(dq or \(dqsport_info\(dq table respectively.
 
 .SH DATABASE SCHEMA
 .P
-At the top level, we have one table for each of the XML document types
-that we import. For example, the documents corresponding to
-\fInewsxml.dtd\fR will have a table called \(dqnews\(dq.
+At the top level (with two notable exceptions), we have one table for
+each of the XML document types that we import. For example, the
+documents corresponding to \fInewsxml.dtd\fR will have a table called
+\(dqnews\(dq. All top-level tables contain two important fields,
+\(dqxml_file_id\(dq and \(dqtime_stamp\(dq. The former is unique and
+prevents us from inserting the same data twice. The time stamp on the
+other hand lets us know when the data is old and can be removed. The
+database schema make it possible to delete only the outdated top-level
+records; all transient children should be removed by triggers.
 .P
 These top-level tables will often have children. For example, each
 news item has zero or more locations associated with it. The child
 table will be named <parent>_<children>, which in this case
-corresponsds to \(dqnews_locations\(dq.
-.P
-To relate the two, a third table exists with name <parent
-table>__<child table>. Note the two underscores. This prevents
-ambiguity when the child table itself contains underscores. As long we
-never go more than one level down, this system should suffice. The
-table joining \(dqnews\(dq with \(dqnews_locations\(dq is thus called
-\(dqnews__news_locations\(dq.
-.P
-Wherever possible, children are kept unique to prevent pointless
-duplication. This slows down inserts, and speeds up reads (which we
-assume are much more frequent). The current rate at which the feed
-transmits XML is much too slow to cause problems inserting.
+corresponds to \(dqnews_locations\(dq.
+.P
+To relate the two, a third table may exist with name
+<parent>__<child>. Note the two underscores. This prevents ambiguity
+when the child table itself contains underscores. The table joining
+\(dqnews\(dq with \(dqnews_locations\(dq is thus called
+\(dqnews__news_locations\(dq. This is necessary when the child table
+has a unique constraint; we don't want to blindly insert duplicate
+records keyed to the parent. Instead we'd like to use the third table
+to map an existing child to the new parent.
+.P
+Where it makes sense, children are kept unique to prevent pointless
+duplication. This slows down inserts, and speeds up reads (which are
+much more frequent). There is a tradeoff to be made, however. For a
+table with a small, fixed upper bound on the number of rows (like
+\(dqodds_casinos\(dq), there is great benefit to de-duplication. The
+total number of rows stays small, so inserts are still quick, and many
+duplicate rows are eliminated.
+.P
+But, with a table like \(dqodds_games\(dq, the number of games grows
+quickly and without bound. It is therefore more beneficial to be able
+to delete the old games (through an ON DELETE CASCADE, tied to
+\(dqodds\(dq) than it is to eliminate duplication. A table like
+\(dqnews_locations\(dq is somewhere in-between. It is hoped that the
+unique constraint in the top-level table's \(dqxml_file_id\(dq will
+prevent duplication in this case anyway.
+.P
+The aforementioned exceptions are the \(dqgame_info\(dq and
+\(dqsport_info\(dq tables. These tables contain the raw XML for a
+number of DTDs that are not handled individually. This is partially
+for backwards-compatibility with a legacy implementation, but is
+mostly a stopgap due to a lack of resources at the moment. These two
+tables (game_info and sport_info) still possess timestamps that allow
+us to prune old data.
+.P
+UML diagrams of the resulting database schema for each XML document
+type are provided with the \fBhtsn-import\fR documentation.
+
+.SH XML Schema Oddities
+.P
+There are a number of problems with the XML on the wire. Even if we
+construct the DTDs ourselves, the results are sometimes
+inconsistent. Here we document a few of them.
+
+.IP \[bu] 2
+Odds_XML.dtd
+
+The <Notes> elements here are supposed to be associated with a set of
+<Game> elements, but since the pair
+(<Notes>...</Notes><Game>...</Game>) can appear zero or more times,
+this leads to ambiguity in parsing. We therefore ignore the notes
+entirely (although a hack is employed to facilitate parsing).
+
+.IP \[bu]
+weatherxml.dtd
+
+There appear to be two types of weather documents; the first has
+<listing> contained within <forecast> and the second has <forecast>
+contained within <listing>. While it would be possible to parse both,
+it would greatly complicate things. The first form is more common, so
+that's all we support for now.
+
+.SH OPTIONS
+
+.IP \fB\-\-backend\fR,\ \fB\-b\fR
+The RDBMS backend to use. Valid choices are \fISqlite\fR and
+\fIPostgres\fR. Capitalization is important, sorry.
+
+Default: Sqlite
+
+.IP \fB\-\-connection-string\fR,\ \fB\-c\fR
+The connection string used for connecting to the database backend
+given by the \fB\-\-backend\fR option. The default is appropriate for
+the \fISqlite\fR backend.
+
+Default: \(dq:memory:\(dq
+
+.IP \fB\-\-log-file\fR
+If you specify a file here, logs will be written to it (possibly in
+addition to syslog). Can be either a relative or absolute path. It
+will not be auto-rotated; use something like logrotate for that.
+
+Default: none
+
+.IP \fB\-\-log-level\fR
+How verbose should the logs be? We log notifications at four levels:
+DEBUG, INFO, WARN, and ERROR. Specify the \(dqmost boring\(dq level of
+notifications you would like to receive (in all-caps); more
+interesting notifications will be logged as well. The debug output is
+extremely verbose and will not be written to syslog even if you try.
+
+Default: INFO
+
+.IP \fB\-\-remove\fR,\ \fB\-r\fR
+Remove successfully processed files. If you enable this, you can see
+at a glance which XML files are not being processed, because they're
+all that should be left.
+
+Default: disabled
+
+.IP \fB\-\-syslog\fR,\ \fB\-s\fR
+Enable logging to syslog. On Windows this will attempt to communicate
+(over UDP) with a syslog daemon on localhost, which will most likely
+not work.
+
+Default: disabled
+
+.SH CONFIGURATION FILE
+.P
+Any of the command-line options mentioned above can be specified in a
+configuration file instead. We first look for \(dqhtsn-importrc\(dq in
+the system configuration directory. We then look for a file named
+\(dq.htsn-importrc\(dq in the user's home directory. The latter will
+override the former.
+.P
+The user's home directory is simply $HOME on Unix; on Windows it's
+wherever %APPDATA% points. The system configuration directory is
+determined by Cabal; the \(dqsysconfdir\(dq parameter during the
+\(dqconfigure\(dq step is used.
+.P
+The file's syntax is given by examples in the htsn-importrc.example file
+(included with \fBhtsn-import\fR).
+.P
+Options specified on the command-line override those in either
+configuration file.
+
+.SH EXAMPLES
+.IP \[bu] 2
+Import newsxml.xml into a preexisting sqlite database named \(dqfoo.sqlite3\(dq:
+
+.nf
+.I $ htsn-import --connection-string='foo.sqlite3' \\\\
+.I "              test/xml/newsxml.xml"
+Successfully imported test/xml/newsxml.xml.
+Imported 1 document(s) total.
+.fi
+.IP \[bu]
+Repeat the previous example, but delete newsxml.xml afterwards:
+
+.nf
+.I $ htsn-import --connection-string='foo.sqlite3' \\\\
+.I "              --remove test/xml/newsxml.xml"
+Successfully imported test/xml/newsxml.xml.
+Imported 1 document(s) total.
+Removed processed file test/xml/newsxml.xml.
+.fi
+.IP \[bu]
+Use a Postgres database instead of the default Sqlite. This assumes
+that you have a database named \(dqhtsn\(dq accessible to user
+\(dqpostgres\(dq locally:
+
+.nf
+.I $ htsn-import --connection-string='dbname=htsn user=postgres' \\\\
+.I "              --backend=Postgres test/xml/newsxml.xml"
+Successfully imported test/xml/newsxml.xml.
+Imported 1 document(s) total.
+.fi
+
+.SH BUGS
+
+.P
+Send bugs to michael@orlitzky.com.