X-Git-Url: http://gitweb.michael.orlitzky.com/?a=blobdiff_plain;f=doc%2Fman1%2Fhtsn-import.1;h=256dfd5980a3cc015d90e734998b366fff759cc2;hb=1f90ca373a3239b4903d663c021f0e9f2616d379;hp=0816be35d33cc5eb8a035dc551e40d7e05f1b44b;hpb=ec6c5b56f8e3096786e8f0a0d3c5c3c1610b69f7;p=dead%2Fhtsn-import.git diff --git a/doc/man1/htsn-import.1 b/doc/man1/htsn-import.1 index 0816be3..256dfd5 100644 --- a/doc/man1/htsn-import.1 +++ b/doc/man1/htsn-import.1 @@ -8,26 +8,428 @@ htsn-import \- Import XML files from The Sports Network into an RDBMS. \fBhtsn-import\fR [OPTIONS] [FILES] .SH DESCRIPTION +.P +The Sports Network offers an XML feed +containing various sports news and statistics. Our sister program +\fBhtsn\fR is capable of retrieving the feed and saving the individual +XML documents contained therein. But what to do with them? +.P +The purpose of \fBhtsn-import\fR is to take these XML documents and +get them into something we can use, a relational database management +system (RDBMS), otherwise known as a SQL database. The structure of +relational database, is, well, relational, and the feed XML is not. So +there is some work to do before the data can be imported into the +database. +.P +First, we must parse the XML. Each supported document type (see below) +has a full pickle/unpickle implementation (\(dqpickle\(dq is simply a +synonym for serialize here). That means that we parse the entire +document into a data structure, and if we pickle (serialize) that data +structure, we get the exact same XML document tha we started with. +.P +This is important for two reasons. First, it serves as a second level +of validation. The first validation is performed by the XML parser, +but if that succeeds and unpicking fails, we know that something is +fishy. Second, we don't ever want to be surprised by some new element +or attribute showing up in the XML. The fact that we can unpickle the +whole thing now means that we won't be surprised in the future. +.P +The aforementioned feature is especially important because we +automatically migrate the database schema every time we import a +document. If you attempt to import a \(dqnewsxml.dtd\(dq document, all +database objects relating to the news will be created if they do not +exist. We don't want the schema to change out from under us without +warning, so it's important that no XML be parsed that would result in +a different schema than we had previously. Since we can +pickle/unpickle everything already, this should be impossible. + +.SH SUPPORTED DOCUMENT TYPES +.P +The XML document types obtained from the feed are uniquely identified +by their DTDs. We currently support documents with the following DTDs: +.IP \[bu] 2 +AutoRacingResultsXML.dtd +.IP \[bu] +Auto_Racing_Schedule_XML.dtd +.IP \[bu] +Heartbeat.dtd +.IP \[bu] +Injuries_Detail_XML.dtd +.IP \[bu] +injuriesxml.dtd +.IP \[bu] +newsxml.dtd +.IP \[bu] +Odds_XML.dtd +.IP \[bu] +scoresxml.dtd +.IP \[bu] +weatherxml.dtd +.IP \[bu] +GameInfo +.RS +.IP \[bu] +CBASK_Lineup_XML.dtd +.IP \[bu] +cbaskpreviewxml.dtd +.IP \[bu] +cflpreviewxml.dtd +.IP \[bu] +Matchup_NBA_NHL_XML.dtd +.IP \[bu] +MLB_Gaming_Matchup_XML.dtd +.IP \[bu] +MLB_Lineup_XML.dtd +.IP \[bu] +MLB_Matchup_XML.dtd +.IP \[bu] +MLS_Preview_XML.dtd +.IP \[bu] +mlbpreviewxml.dtd +.IP \[bu] +NBA_Gaming_Matchup_XML.dtd +.IP \[bu] +NBA_Playoff_Matchup_XML.dtd +.IP \[bu] +NBALineupXML.dtd +.IP \[bu] +nbapreviewxml.dtd +.IP \[bu] +NCAA_FB_Preview_XML.dtd +.IP \[bu] +NFL_NCAA_FB_Matchup_XML.dtd +.IP \[bu] +nflpreviewxml.dtd +.IP \[bu] +nhlpreviewxml.dtd +.IP \[bu] +recapxml.dtd +.IP \[bu] +WorldBaseballPreviewXML.dtd +.RE +.IP \[bu] +SportInfo +.RS +.IP \[bu] +CBASK_3PPctXML.dtd +.IP \[bu] +Cbask_All_Tourn_Teams_XML.dtd +.IP \[bu] +CBASK_AssistsXML.dtd +.IP \[bu] +Cbask_Awards_XML.dtd +.IP \[bu] +CBASK_BlocksXML.dtd +.IP \[bu] +Cbask_Conf_Standings_XML.dtd +.IP \[bu] +Cbask_DivII_III_Indv_Stats_XML.dtd +.IP \[bu] +Cbask_DivII_Team_Stats_XML.dtd +.IP \[bu] +Cbask_DivIII_Team_Stats_XML.dtd +.IP \[bu] +CBASK_FGPctXML.dtd +.IP \[bu] +CBASK_FoulsXML.dtd +.IP \[bu] +CBASK_FTPctXML.dtd +.IP \[bu] +Cbask_Indv_Scoring_XML.dtd +.IP \[bu] +CBASK_MinutesXML.dtd +.IP \[bu] +Cbask_Polls_XML.dtd +.IP \[bu] +CBASK_ReboundsXML.dtd +.IP \[bu] +CBASK_ScoringLeadersXML.dtd +.IP \[bu] +Cbask_Team_ThreePT_Made_XML.dtd +.IP \[bu] +Cbask_Team_ThreePT_PCT_XML.dtd +.IP \[bu] +Cbask_Team_Win_Pct_XML.dtd +.IP \[bu] +Cbask_Top_Twenty_Five_XML.dtd +.IP \[bu] +CBASK_TopTwentyFiveResult_XML.dtd +.IP \[bu] +Cbask_Tourn_Awards_XML.dtd +.IP \[bu] +Cbask_Tourn_Champs_XML.dtd +.IP \[bu] +Cbask_Tourn_Indiv_XML.dtd +.IP \[bu] +Cbask_Tourn_Leaders_XML.dtd +.IP \[bu] +Cbask_Tourn_MVP_XML.dtd +.IP \[bu] +Cbask_Tourn_Records_XML.dtd +.IP \[bu] +LeagueScheduleXML.dtd +.IP \[bu] +minorscoresxml.dtd +.IP \[bu] +Minor_Baseball_League_Leaders_XML.dtd +.IP \[bu] +Minor_Baseball_Standings_XML.dtd +.IP \[bu] +Minor_Baseball_Transactions_XML.dtd +.IP \[bu] +mlbbattingavgxml.dtd +.IP \[bu] +mlbdoublesleadersxml.dtd +.IP \[bu] +MLBGamesPlayedXML.dtd +.IP \[bu] +MLBGIDPXML.dtd +.IP \[bu] +MLBHitByPitchXML.dtd +.IP \[bu] +mlbhitsleadersxml.dtd +.IP \[bu] +mlbhomerunsxml.dtd +.IP \[bu] +MLBHRFreqXML.dtd +.IP \[bu] +MLBIntWalksXML.dtd +.IP \[bu] +MLBKORateXML.dtd +.IP \[bu] +mlbonbasepctxml.dtd +.IP \[bu] +MLBOPSXML.dtd +.IP \[bu] +MLBPlateAppsXML.dtd +.IP \[bu] +mlbrbisxml.dtd +.IP \[bu] +mlbrunsleadersxml.dtd +.IP \[bu] +MLBSacFliesXML.dtd +.IP \[bu] +MLBSacrificesXML.dtd +.IP \[bu] +MLBSBSuccessXML.dtd +.IP \[bu] +mlbsluggingpctxml.dtd +.IP \[bu] +mlbstandxml.dtd +.IP \[bu] +mlbstandxml_preseason.dtd +.IP \[bu] +mlbstolenbasexml.dtd +.IP \[bu] +mlbtotalbasesleadersxml.dtd +.IP \[bu] +mlbtriplesleadersxml.dtd +.IP \[bu] +MLBWalkRateXML.dtd +.IP \[bu] +mlbwalksleadersxml.dtd +.IP \[bu] +MLBXtraBaseHitsXML.dtd +.IP \[bu] +MLB_Pitching_Appearances_Leaders.dtd +.IP \[bu] +MLB_ERA_Leaders.dtd +.IP \[bu] +MLB_Pitching_Balks_Leaders.dtd +.IP \[bu] +MLB_Pitching_CG_Leaders.dtd +.IP \[bu] +MLB_Pitching_ER_Allowed_Leaders.dtd +.IP \[bu] +MLB_Pitching_Hits_Allowed_Leaders.dtd +.IP \[bu] +MLB_Pitching_Hit_Batters_Leaders.dtd +.IP \[bu] +MLB_Pitching_HR_Allowed_Leaders.dtd +.IP \[bu] +MLB_Pitching_IP_Leaders.dtd +.IP \[bu] +MLB_Pitching_Runs_Allowed_Leaders.dtd +.RE +.P +The GameInfo and SportInfo types do not have their own top-level +tables in the database. Instead, their raw XML is stored in either the +\(dqgame_info\(dq or \(dqsport_info\(dq table respectively. .SH DATABASE SCHEMA .P -At the top level, we have one table for each of the XML document types -that we import. For example, the documents corresponding to -\fInewsxml.dtd\fR will have a table called \(dqnews\(dq. +At the top level (with two notable exceptions), we have one table for +each of the XML document types that we import. For example, the +documents corresponding to \fInewsxml.dtd\fR will have a table called +\(dqnews\(dq. All top-level tables contain two important fields, +\(dqxml_file_id\(dq and \(dqtime_stamp\(dq. The former is unique and +prevents us from inserting the same data twice. The time stamp on the +other hand lets us know when the data is old and can be removed. The +database schema make it possible to delete only the outdated top-level +records; all transient children should be removed by triggers. .P These top-level tables will often have children. For example, each news item has zero or more locations associated with it. The child table will be named _, which in this case -corresponsds to \(dqnews_locations\(dq. -.P -To relate the two, a third table exists with name __. Note the two underscores. This prevents -ambiguity when the child table itself contains underscores. As long we -never go more than one level down, this system should suffice. The -table joining \(dqnews\(dq with \(dqnews_locations\(dq is thus called -\(dqnews__news_locations\(dq. -.P -Wherever possible, children are kept unique to prevent pointless -duplication. This slows down inserts, and speeds up reads (which we -assume are much more frequent). The current rate at which the feed -transmits XML is much too slow to cause problems inserting. +corresponds to \(dqnews_locations\(dq. +.P +To relate the two, a third table may exist with name +__. Note the two underscores. This prevents ambiguity +when the child table itself contains underscores. The table joining +\(dqnews\(dq with \(dqnews_locations\(dq is thus called +\(dqnews__news_locations\(dq. This is necessary when the child table +has a unique constraint; we don't want to blindly insert duplicate +records keyed to the parent. Instead we'd like to use the third table +to map an existing child to the new parent. +.P +Where it makes sense, children are kept unique to prevent pointless +duplication. This slows down inserts, and speeds up reads (which are +much more frequent). There is a tradeoff to be made, however. For a +table with a small, fixed upper bound on the number of rows (like +\(dqodds_casinos\(dq), there is great benefit to de-duplication. The +total number of rows stays small, so inserts are still quick, and many +duplicate rows are eliminated. +.P +But, with a table like \(dqodds_games\(dq, the number of games grows +quickly and without bound. It is therefore more beneficial to be able +to delete the old games (through an ON DELETE CASCADE, tied to +\(dqodds\(dq) than it is to eliminate duplication. A table like +\(dqnews_locations\(dq is somewhere in-between. It is hoped that the +unique constraint in the top-level table's \(dqxml_file_id\(dq will +prevent duplication in this case anyway. +.P +The aforementioned exceptions are the \(dqgame_info\(dq and +\(dqsport_info\(dq tables. These tables contain the raw XML for a +number of DTDs that are not handled individually. This is partially +for backwards-compatibility with a legacy implementation, but is +mostly a stopgap due to a lack of resources at the moment. These two +tables (game_info and sport_info) still possess timestamps that allow +us to prune old data. +.P +UML diagrams of the resulting database schema for each XML document +type are provided with the \fBhtsn-import\fR documentation. + +.SH XML Schema Oddities +.P +There are a number of problems with the XML on the wire. Even if we +construct the DTDs ourselves, the results are sometimes +inconsistent. Here we document a few of them. + +.IP \[bu] 2 +Odds_XML.dtd + +The elements here are supposed to be associated with a set of + elements, but since the pair +(......) can appear zero or more times, +this leads to ambiguity in parsing. We therefore ignore the notes +entirely (although a hack is employed to facilitate parsing). + +.IP \[bu] +weatherxml.dtd + +There appear to be two types of weather documents; the first has + contained within and the second has +contained within . While it would be possible to parse both, +it would greatly complicate things. The first form is more common, so +that's all we support for now. + +.SH OPTIONS + +.IP \fB\-\-backend\fR,\ \fB\-b\fR +The RDBMS backend to use. Valid choices are \fISqlite\fR and +\fIPostgres\fR. Capitalization is important, sorry. + +Default: Sqlite + +.IP \fB\-\-connection-string\fR,\ \fB\-c\fR +The connection string used for connecting to the database backend +given by the \fB\-\-backend\fR option. The default is appropriate for +the \fISqlite\fR backend. + +Default: \(dq:memory:\(dq + +.IP \fB\-\-log-file\fR +If you specify a file here, logs will be written to it (possibly in +addition to syslog). Can be either a relative or absolute path. It +will not be auto-rotated; use something like logrotate for that. + +Default: none + +.IP \fB\-\-log-level\fR +How verbose should the logs be? We log notifications at four levels: +DEBUG, INFO, WARN, and ERROR. Specify the \(dqmost boring\(dq level of +notifications you would like to receive (in all-caps); more +interesting notifications will be logged as well. The debug output is +extremely verbose and will not be written to syslog even if you try. + +Default: INFO + +.IP \fB\-\-remove\fR,\ \fB\-r\fR +Remove successfully processed files. If you enable this, you can see +at a glance which XML files are not being processed, because they're +all that should be left. + +Default: disabled + +.IP \fB\-\-syslog\fR,\ \fB\-s\fR +Enable logging to syslog. On Windows this will attempt to communicate +(over UDP) with a syslog daemon on localhost, which will most likely +not work. + +Default: disabled + +.SH CONFIGURATION FILE +.P +Any of the command-line options mentioned above can be specified in a +configuration file instead. We first look for \(dqhtsn-importrc\(dq in +the system configuration directory. We then look for a file named +\(dq.htsn-importrc\(dq in the user's home directory. The latter will +override the former. +.P +The user's home directory is simply $HOME on Unix; on Windows it's +wherever %APPDATA% points. The system configuration directory is +determined by Cabal; the \(dqsysconfdir\(dq parameter during the +\(dqconfigure\(dq step is used. +.P +The file's syntax is given by examples in the htsn-importrc.example file +(included with \fBhtsn-import\fR). +.P +Options specified on the command-line override those in either +configuration file. + +.SH EXAMPLES +.IP \[bu] 2 +Import newsxml.xml into a preexisting sqlite database named \(dqfoo.sqlite3\(dq: + +.nf +.I $ htsn-import --connection-string='foo.sqlite3' \\\\ +.I " test/xml/newsxml.xml" +Successfully imported test/xml/newsxml.xml. +Imported 1 document(s) total. +.fi +.IP \[bu] +Repeat the previous example, but delete newsxml.xml afterwards: + +.nf +.I $ htsn-import --connection-string='foo.sqlite3' \\\\ +.I " --remove test/xml/newsxml.xml" +Successfully imported test/xml/newsxml.xml. +Imported 1 document(s) total. +Removed processed file test/xml/newsxml.xml. +.fi +.IP \[bu] +Use a Postgres database instead of the default Sqlite. This assumes +that you have a database named \(dqhtsn\(dq accessible to user +\(dqpostgres\(dq locally: + +.nf +.I $ htsn-import --connection-string='dbname=htsn user=postgres' \\\\ +.I " --backend=Postgres test/xml/newsxml.xml" +Successfully imported test/xml/newsxml.xml. +Imported 1 document(s) total. +.fi + +.SH BUGS + +.P +Send bugs to michael@orlitzky.com.