X-Git-Url: http://gitweb.michael.orlitzky.com/?a=blobdiff_plain;f=doc%2Fman1%2Fhtsn-import.1;h=963f4504d1111e63e5da07c1c18db0259f9d09f4;hb=00fd3d5b3b41cbb9691bbf6612cb72b099bd2b77;hp=66506c6febaf300dd8be82ce7e5f9c1b624627fb;hpb=65b7b8eccef710e67fa6050dd4bdaffbbea708a7;p=dead%2Fhtsn-import.git diff --git a/doc/man1/htsn-import.1 b/doc/man1/htsn-import.1 index 66506c6..963f450 100644 --- a/doc/man1/htsn-import.1 +++ b/doc/man1/htsn-import.1 @@ -16,9 +16,10 @@ XML documents contained therein. But what to do with them? .P The purpose of \fBhtsn-import\fR is to take these XML documents and get them into something we can use, a relational database management -system (RDBMS), loosely known as a SQL database. The structure of +system (RDBMS), otherwise known as a SQL database. The structure of relational database, is, well, relational, and the feed XML is not. So -there is some work to do before the data can be inserted. +there is some work to do before the data can be imported into the +database. .P First, we must parse the XML. Each supported document type (see below) has a full pickle/unpickle implementation (\(dqpickle\(dq is simply a @@ -47,49 +48,304 @@ pickle/unpickle everything already, this should be impossible. The XML document types obtained from the feed are uniquely identified by their DTDs. We currently support documents with the following DTDs: .IP \[bu] 2 -Heartbeat.dtd +AutoRacingResultsXML.dtd .IP \[bu] -newsxml.dtd +Auto_Racing_Schedule_XML.dtd +.IP \[bu] +Heartbeat.dtd .IP \[bu] Injuries_Detail_XML.dtd .IP \[bu] injuriesxml.dtd .IP \[bu] +newsxml.dtd +.IP \[bu] Odds_XML.dtd .IP \[bu] +scoresxml.dtd +.IP \[bu] weatherxml.dtd +.IP \[bu] +GameInfo +.RS +.IP \[bu] +CBASK_Lineup_XML.dtd +.IP \[bu] +cbaskpreviewxml.dtd +.IP \[bu] +cflpreviewxml.dtd +.IP \[bu] +Matchup_NBA_NHL_XML.dtd +.IP \[bu] +MLB_Gaming_Matchup_XML.dtd +.IP \[bu] +MLB_Lineup_XML.dtd +.IP \[bu] +MLB_Matchup_XML.dtd +.IP \[bu] +MLS_Preview_XML.dtd +.IP \[bu] +mlbpreviewxml.dtd +.IP \[bu] +NBA_Gaming_Matchup_XML.dtd +.IP \[bu] +NBA_Playoff_Matchup_XML.dtd +.IP \[bu] +NBALineupXML.dtd +.IP \[bu] +nbapreviewxml.dtd +.IP \[bu] +NCAA_FB_Preview_XML.dtd +.IP \[bu] +NFL_NCAA_FB_Matchup_XML.dtd +.IP \[bu] +nflpreviewxml.dtd +.IP \[bu] +nhlpreviewxml.dtd +.IP \[bu] +recapxml.dtd +.IP \[bu] +WorldBaseballPreviewXML.dtd +.RE +.IP \[bu] +SportInfo +.RS +.IP \[bu] +CBASK_3PPctXML.dtd +.IP \[bu] +Cbask_All_Tourn_Teams_XML.dtd +.IP \[bu] +CBASK_AssistsXML.dtd +.IP \[bu] +Cbask_Awards_XML.dtd +.IP \[bu] +CBASK_BlocksXML.dtd +.IP \[bu] +Cbask_Conf_Standings_XML.dtd +.IP \[bu] +Cbask_DivII_III_Indv_Stats_XML.dtd +.IP \[bu] +Cbask_DivII_Team_Stats_XML.dtd +.IP \[bu] +Cbask_DivIII_Team_Stats_XML.dtd +.IP \[bu] +CBASK_FGPctXML.dtd +.IP \[bu] +CBASK_FoulsXML.dtd +.IP \[bu] +CBASK_FTPctXML.dtd +.IP \[bu] +Cbask_Indv_Scoring_XML.dtd +.IP \[bu] +CBASK_MinutesXML.dtd +.IP \[bu] +Cbask_Polls_XML.dtd +.IP \[bu] +CBASK_ReboundsXML.dtd +.IP \[bu] +CBASK_ScoringLeadersXML.dtd +.IP \[bu] +Cbask_Team_ThreePT_Made_XML.dtd +.IP \[bu] +Cbask_Team_ThreePT_PCT_XML.dtd +.IP \[bu] +Cbask_Team_Win_Pct_XML.dtd +.IP \[bu] +Cbask_Top_Twenty_Five_XML.dtd +.IP \[bu] +CBASK_TopTwentyFiveResult_XML.dtd +.IP \[bu] +Cbask_Tourn_Awards_XML.dtd +.IP \[bu] +Cbask_Tourn_Champs_XML.dtd +.IP \[bu] +Cbask_Tourn_Indiv_XML.dtd +.IP \[bu] +Cbask_Tourn_Leaders_XML.dtd +.IP \[bu] +Cbask_Tourn_MVP_XML.dtd +.IP \[bu] +Cbask_Tourn_Records_XML.dtd +.IP \[bu] +LeagueScheduleXML.dtd +.IP \[bu] +minorscoresxml.dtd +.IP \[bu] +Minor_Baseball_League_Leaders_XML.dtd +.IP \[bu] +Minor_Baseball_Standings_XML.dtd +.IP \[bu] +Minor_Baseball_Transactions_XML.dtd +.IP \[bu] +mlbbattingavgxml.dtd +.IP \[bu] +mlbdoublesleadersxml.dtd +.IP \[bu] +MLBGamesPlayedXML.dtd +.IP \[bu] +MLBGIDPXML.dtd +.IP \[bu] +MLBHitByPitchXML.dtd +.IP \[bu] +mlbhitsleadersxml.dtd +.IP \[bu] +mlbhomerunsxml.dtd +.IP \[bu] +MLBHRFreqXML.dtd +.IP \[bu] +MLBIntWalksXML.dtd +.IP \[bu] +MLBKORateXML.dtd +.IP \[bu] +mlbonbasepctxml.dtd +.IP \[bu] +MLBOPSXML.dtd +.IP \[bu] +MLBPlateAppsXML.dtd +.IP \[bu] +mlbrbisxml.dtd +.IP \[bu] +mlbrunsleadersxml.dtd +.IP \[bu] +MLBSacFliesXML.dtd +.IP \[bu] +MLBSacrificesXML.dtd +.IP \[bu] +MLBSBSuccessXML.dtd +.IP \[bu] +mlbsluggingpctxml.dtd +.IP \[bu] +mlbstandxml.dtd +.IP \[bu] +mlbstandxml_preseason.dtd +.IP \[bu] +mlbstolenbasexml.dtd +.IP \[bu] +mlbtotalbasesleadersxml.dtd +.IP \[bu] +mlbtriplesleadersxml.dtd +.IP \[bu] +MLBWalkRateXML.dtd +.IP \[bu] +mlbwalksleadersxml.dtd +.IP \[bu] +MLBXtraBaseHitsXML.dtd +.IP \[bu] +MLB_Pitching_Appearances_Leaders.dtd +.IP \[bu] +MLB_ERA_Leaders.dtd +.IP \[bu] +MLB_Pitching_Balks_Leaders.dtd +.IP \[bu] +MLB_Pitching_CG_Leaders.dtd +.IP \[bu] +MLB_Pitching_ER_Allowed_Leaders.dtd +.IP \[bu] +MLB_Pitching_Hits_Allowed_Leaders.dtd +.IP \[bu] +MLB_Pitching_Hit_Batters_Leaders.dtd +.IP \[bu] +MLB_Pitching_HR_Allowed_Leaders.dtd +.IP \[bu] +MLB_Pitching_IP_Leaders.dtd +.IP \[bu] +MLB_Pitching_Runs_Allowed_Leaders.dtd +.IP \[bu] +MLB_Pitching_Saves_Leaders.dtd +.IP \[bu] +MLB_Pitching_Shut_Outs_Leaders.dtd +.IP \[bu] +MLB_Pitching_Starts_Leaders.dtd +.IP \[bu] +MLB_Pitching_Strike_Outs_Leaders.dtd +.IP \[bu] +MLB_Pitching_Walks_Leaders.dtd +.IP \[bu] +MLB_Pitching_WHIP_Leaders.dtd +.IP \[bu] +MLB_Pitching_Wild_Pitches_Leaders.dtd +.RE +.P +The GameInfo and SportInfo types do not have their own top-level +tables in the database. Instead, their raw XML is stored in either the +\(dqgame_info\(dq or \(dqsport_info\(dq table respectively. .SH DATABASE SCHEMA .P -At the top level, we have one table for each of the XML document types -that we import. For example, the documents corresponding to -\fInewsxml.dtd\fR will have a table called \(dqnews\(dq. +At the top level (with two notable exceptions), we have one table for +each of the XML document types that we import. For example, the +documents corresponding to \fInewsxml.dtd\fR will have a table called +\(dqnews\(dq. All top-level tables contain two important fields, +\(dqxml_file_id\(dq and \(dqtime_stamp\(dq. The former is unique and +prevents us from inserting the same data twice. The time stamp on the +other hand lets us know when the data is old and can be removed. The +database schema make it possible to delete only the outdated top-level +records; all transient children should be removed by triggers. .P These top-level tables will often have children. For example, each news item has zero or more locations associated with it. The child table will be named _, which in this case -corresponsds to \(dqnews_locations\(dq. +corresponds to \(dqnews_locations\(dq. +.P +To relate the two, a third table may exist with name +__. Note the two underscores. This prevents ambiguity +when the child table itself contains underscores. The table joining +\(dqnews\(dq with \(dqnews_locations\(dq is thus called +\(dqnews__news_locations\(dq. This is necessary when the child table +has a unique constraint; we don't want to blindly insert duplicate +records keyed to the parent. Instead we'd like to use the third table +to map an existing child to the new parent. .P -To relate the two, a third table exists with name __. Note the two underscores. This prevents -ambiguity when the child table itself contains underscores. As long we -never go more than one level down, this system should suffice. The -table joining \(dqnews\(dq with \(dqnews_locations\(dq is thus called -\(dqnews__news_locations\(dq. +Where it makes sense, children are kept unique to prevent pointless +duplication. This slows down inserts, and speeds up reads (which are +much more frequent). There is a tradeoff to be made, however. For a +table with a small, fixed upper bound on the number of rows (like +\(dqodds_casinos\(dq), there is great benefit to de-duplication. The +total number of rows stays small, so inserts are still quick, and many +duplicate rows are eliminated. .P -Wherever possible, children are kept unique to prevent pointless -duplication. This slows down inserts, and speeds up reads (which we -assume are much more frequent). The current rate at which the feed -transmits XML is much too slow to cause problems inserting. +But, with a table like \(dqodds_games\(dq, the number of games grows +quickly and without bound. It is therefore more beneficial to be able +to delete the old games (through an ON DELETE CASCADE, tied to +\(dqodds\(dq) than it is to eliminate duplication. A table like +\(dqnews_locations\(dq is somewhere in-between. It is hoped that the +unique constraint in the top-level table's \(dqxml_file_id\(dq will +prevent duplication in this case anyway. +.P +The aforementioned exceptions are the \(dqgame_info\(dq and +\(dqsport_info\(dq tables. These tables contain the raw XML for a +number of DTDs that are not handled individually. This is partially +for backwards-compatibility with a legacy implementation, but is +mostly a stopgap due to a lack of resources at the moment. These two +tables (game_info and sport_info) still possess timestamps that allow +us to prune old data. .P UML diagrams of the resulting database schema for each XML document type are provided with the \fBhtsn-import\fR documentation. + +.SH XML Schema Oddities .P -In some cases the top-level table for a document type has been -omitted. For example, all of the information in the the -\(dqinjuriesxml\(dq documents is contained in \(dqlisting\(dq -elements. We therefore omit the \(dqinjuries\(dq table and create only -\(dqinjuries_listings\(dq. +There are a number of problems with the XML on the wire. Even if we +construct the DTDs ourselves, the results are sometimes +inconsistent. Here we document a few of them. + +.IP \[bu] 2 +Odds_XML.dtd + +The elements here are supposed to be associated with a set of + elements, but since the pair +(......) can appear zero or more times, +this leads to ambiguity in parsing. We therefore ignore the notes +entirely (although a hack is employed to facilitate parsing). + +.IP \[bu] +weatherxml.dtd + +There appear to be two types of weather documents; the first has + contained within and the second has +contained within . While it would be possible to parse both, +it would greatly complicate things. The first form is more common, so +that's all we support for now. .SH OPTIONS @@ -101,7 +357,7 @@ Default: Sqlite .IP \fB\-\-connection-string\fR,\ \fB\-c\fR The connection string used for connecting to the database backend -given by the \fB\-\-baclend\fR option. The default is appropriate for +given by the \fB\-\-backend\fR option. The default is appropriate for the \fISqlite\fR backend. Default: \(dq:memory:\(dq @@ -114,10 +370,11 @@ will not be auto-rotated; use something like logrotate for that. Default: none .IP \fB\-\-log-level\fR -How verbose should the logs be? We log notifications at three levels: -INFO, WARN, and ERROR. Specify the \(dqmost boring\(dq level of +How verbose should the logs be? We log notifications at four levels: +DEBUG, INFO, WARN, and ERROR. Specify the \(dqmost boring\(dq level of notifications you would like to receive (in all-caps); more -interesting notifications will be logged as well. +interesting notifications will be logged as well. The debug output is +extremely verbose and will not be written to syslog even if you try. Default: INFO @@ -135,12 +392,6 @@ not work. Default: disabled -.IP \fB\-\-username\fR,\ \fB\-u\fR -Your TSN username. A username is required, so you must supply one -either on the command line or in a configuration file. - -Default: none - .SH CONFIGURATION FILE .P Any of the command-line options mentioned above can be specified in a