From 181833623b7924580a98089d12d36abd7330e175 Mon Sep 17 00:00:00 2001 From: Michael Orlitzky Date: Thu, 3 Jul 2014 14:29:12 -0400 Subject: [PATCH] Move two READMEs into the man page. Move the list of supported document types to the end of the man page. Add a man page section about new/removed column migrations. --- doc/README.dbschema | 5 - doc/README.schemagen | 49 ---- doc/man1/htsn-import.1 | 509 ++++++++++++++++++++++++++--------------- htsn-import.cabal | 2 - 4 files changed, 324 insertions(+), 241 deletions(-) delete mode 100644 doc/README.dbschema delete mode 100644 doc/README.schemagen diff --git a/doc/README.dbschema b/doc/README.dbschema deleted file mode 100644 index b05cd90..0000000 --- a/doc/README.dbschema +++ /dev/null @@ -1,5 +0,0 @@ -There are database diagrams in the 'doc/dbschema' folder. These are -not authoritative, but if they are incorrect, that should be -considered a bug. - -The diagrams were generated with pgModeler . diff --git a/doc/README.schemagen b/doc/README.schemagen deleted file mode 100644 index d32075b..0000000 --- a/doc/README.schemagen +++ /dev/null @@ -1,49 +0,0 @@ -In order to parse XML, you need to know the structure of your -documents. Usually this is given in the form of a DTD or schema. The -Sports Network does provide DTDs for their XML, but they're wrong! So, -what can we do? - -The easiest option would be to guess and pray. But we need to -construct a database into which to insert the XML. How do we know if - should be a column, or if it should have its own table? We need -to know how many times it can appear. So we need some form of -specification. And reading all of the XML files one at a time to count -the number of s is impractical. So, we would like to generate -the DTDs automatically. - -The process should go something like, - - 1. Generate a DTD from the first foo.xml file we see. Call it - foo.dtd. - - 2. Validate future foo documents against foo.dtd. If they all - validate, great. If one fails, add it to the corpus and update - foo.dtd so that both the original and the new foo.xml validate. - - 3. Repeat until no more failures occur. This can never be perfect: - tomorrow we could get a foo.xml that's wildly different from what - we've seen in the past. But it's the best we can hope for under - the circumstances. - -Enter XML-Schema-learner. This tool can infer a DTD from a set of -sample XML files. The top-level "schemagen" folder (in this project) -contains a number of subfolders -- one for each type of document that -we want to parse. Contained therein are XML samples for that -particular document type. These were hand-picked one at a time -according to the procedure above, and the complete set of XML is what -we use to generate the DTDs used by htsn-import. - -To generate them, run `make schema` at the project -root. XML-Schema-learner will be invoked on each subfolder of -"schemagen" and will output the corresponding DTDs to the "schemagen" -folder. - -Most of the production schemas are generated this way; however, a few -needed manual tweaking. The final, believed-to-be-correct schemas for -all supported document types can be found in the "schema" folder in -the project root. Having the "correct" DTDs available means you -don't need XML-Schema-learner available to install htsn-import. - -As explained in the man page, there is a second type of weatherxml -document that we don't parse at the moment. An example is provided as -schemagen/weatherxml/20143655.xml. diff --git a/doc/man1/htsn-import.1 b/doc/man1/htsn-import.1 index e912c71..3dcf2be 100644 --- a/doc/man1/htsn-import.1 +++ b/doc/man1/htsn-import.1 @@ -43,8 +43,329 @@ exist. We don't want the schema to change out from under us without warning, so it's important that no XML be parsed that would result in a different schema than we had previously. Since we can pickle/unpickle everything already, this should be impossible. +.P +A list of supported document types is given in the appendix. +.P +The GameInfo and SportInfo types do not have their own top-level +tables in the database. Instead, their raw XML is stored in either the +\(dqgame_info\(dq or \(dqsport_info\(dq table respectively. + +.SH DATABASE SCHEMA +.P +At the top level (with two notable exceptions), we have one table for +each of the XML document types that we import. For example, the +documents corresponding to \fInewsxml.dtd\fR will have a table called +\(dqnews\(dq. All top-level tables contain two important fields, +\(dqxml_file_id\(dq and \(dqtime_stamp\(dq. The former is unique and +prevents us from inserting the same data twice. The time stamp on the +other hand lets us know when the data is old and can be removed. The +database schema make it possible to delete only the outdated top-level +records; all transient children should be removed by triggers. +.P +These top-level tables will often have children. For example, each +news item has zero or more locations associated with it. The child +table will be named _, which in this case +corresponds to \(dqnews_locations\(dq. +.P +To relate the two, a third table may exist with name +__. Note the two underscores. This prevents ambiguity +when the child table itself contains underscores. The table joining +\(dqnews\(dq with \(dqnews_locations\(dq is thus called +\(dqnews__news_locations\(dq. This is necessary when the child table +has a unique constraint; we don't want to blindly insert duplicate +records keyed to the parent. Instead we'd like to use the third table +to map an existing child to the new parent. +.P +Where it makes sense, children are kept unique to prevent pointless +duplication. This slows down inserts, and speeds up reads (which are +much more frequent). There is a tradeoff to be made, however. For a +table with a small, fixed upper bound on the number of rows (like +\(dqodds_casinos\(dq), there is great benefit to de-duplication. The +total number of rows stays small, so inserts are still quick, and many +duplicate rows are eliminated. +.P +But, with a table like \(dqodds_games\(dq, the number of games grows +quickly and without bound. It is therefore more beneficial to be able +to delete the old games (through an ON DELETE CASCADE, tied to +\(dqodds\(dq) than it is to eliminate duplication. A table like +\(dqnews_locations\(dq is somewhere in-between. It is hoped that the +unique constraint in the top-level table's \(dqxml_file_id\(dq will +prevent duplication in this case anyway. +.P +The aforementioned exceptions are the \(dqgame_info\(dq and +\(dqsport_info\(dq tables. These tables contain the raw XML for a +number of DTDs that are not handled individually. This is partially +for backwards-compatibility with a legacy implementation, but is +mostly a stopgap due to a lack of resources at the moment. These two +tables (game_info and sport_info) still possess timestamps that allow +us to prune old data. +.P +UML diagrams of the resulting database schema for each XML document +type are provided with the \fBhtsn-import\fR documentation, in the +\fIdoc/dbschema\fR directory. These are not authoritative, but it +should be considered a bug if they are incorrect. The diagrams are +created using the pgModeler tool. -.SH SUPPORTED DOCUMENT TYPES +.SH XML SCHEMA GENERATION +.P +In order to parse XML, you need to know the structure of your +documents. Usually this is given in the form of a DTD or schema. The +Sports Network does provide DTDs for their XML, but unfortunately many +of them do not match the XML found on the feed. +.P +We need to construct a database into which to insert the XML. How do +we know if should be a column, or if it should have its own +table? We need to know how many times it can appear in the +document. So we need some form of specification. Since the supplied +DTDs are incorrect, we would like to generate them automatically. +.P +The process should go something like, +.IP 1. +Generate a DTD from the first foo.xml file we see. Call it foo.dtd. +.IP 2. +Validate future foo documents against foo.dtd. If they all validate, +great. If one fails, add it to the corpus and update foo.dtd so +that both the original and the new foo.xml validate. +.IP 3. +Repeat until no more failures occur. This can never be perfect: +tomorrow we could get a foo.xml that's wildly different from what +we've seen in the past. But it's the best we can hope for under +the circumstances. +.P +Enter XML-Schema-learner +. This tool can infer a +DTD from a set of sample XML files. The top-level \(dqschemagen\(dq +folder (in this project) contains a number of subfolders\(emone for +each type of document that we want to parse. Contained therein are XML +samples for that particular document type. These were hand-picked one +at a time according to the procedure above, and the complete set of +XML is what we use to generate the DTDs used by htsn-import. +.P +To generate them, run `make schema` at the project +root. XML-Schema-learner will be invoked on each subfolder of +\(dqschemagen\(dq and will output the corresponding DTDs to the +\(dqschemagen\(dq folder. +.P +Most of the production schemas are generated this way; however, a few +needed manual tweaking. The final, believed-to-be-correct schemas for +all supported document types can be found in the \(dqschema\(dq folder in +the project root. Having the correct DTDs available means you +don't need XML-Schema-learner available to install \fBhtsn-import\fR. + +.SH XML SCHEMA UPDATES +.P +If a new tag is added to an XML document type, \fBhtsn-import\fR will +most likely refuse to parse it, since the new documents no longer +match the existing DTD. +.P +The first thing to do in that case is add the unparseable document to +the \(dqschemagen\(dq directory, and generate a new DTD that matches +both the old and new samples. Once a new, correct DTD has been +generated, it should be added to the \(dqschema\(dq directory. Then, +the parser can be updated and \fBhtsn-import\fR rebuilt. +.P +At this point, \fBhtsn-import\fR should be capable of importing the +new document. But the addition of the new tag will most require new +fields in the database. Fortunately, easy migrations like this are +handled automatically. As an example, at one point, \fIOdds_XML.dtd\fR +did not contain the \(dqHStarter\(dq and \(dqAStarter\(dq elements +associated with its games. Suppose we parse one of the old documents +(without \(dqHStarter\(dq and \(dqAStarter\(dq) using an old version +of \fBhtsn-import\fR: +.P +.nf +.I $ htsn-import --connection-string='foo.sqlite3' \\\\ +.I " schemagen/Odds_XML/19996433.xml" +Migration: CREATE TABLE \(dqodds\(dq ... +Successfully imported schemagen/Odds_XML/19996433.xml. +Processed 1 document(s) total. +.fi +.P +At this point, the database schema matches the old documents, i.e. the +ones without \fIAStarter\fR and \fIHStarter\fR. If we use a new +version of \fBhtsn-import\fR, supporting the new fields, the migration +is handled gracefully: +.P +.nf +.I $ htsn-import --connection-string='foo.sqlite3' \\\\ +.I " schemagen/Odds_XML/21315768.xml" +Migration: ALTER TABLE \(dqodds_games\(dq + ADD COLUMN \(dqaway_team_starter_id\(dq INTEGER; +Migration: ALTER TABLE \(dqodds_games\(dq + ADD COLUMN \(dqaway_team_starter_name\(dq VARCHAR; +Migration: ALTER TABLE \(dqodds_games\(dq + ADD COLUMN \(dqhome_team_starter_id\(dq INTEGER; +Migration: ALTER TABLE \(dqodds_games\(dq + ADD COLUMN \(dqhome_team_starter_name\(dq VARCHAR; +Successfully imported schemagen/Odds_XML/21315768.xml. +Processed 1 document(s) total. +.fi +.P +If fields are removed from the schema, then manual intervention may be +necessary: +.P +.nf +.I $ htsn-import -b Postgres -c 'dbname=htsn user=postgres' \\\\ +.I " schemagen/Odds_XML/19996433.xml" +ERROR: Database migration: manual intervention required. +The following actions are considered unsafe: +ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqaway_team_starter_id\(dq +ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqaway_team_starter_name\(dq +ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqhome_team_starter_id\(dq +ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqhome_team_starter_name\(dq + +ERROR: Failed to import file schemagen/Odds_XML/19996433.xml. +Processed 0 document(s) total. +.fi +.P +To fix these errors, manually invoke the SQL commands that were +considered unsafe: +.P +.nf +.I $ psql -U postgres -d htsn \\\\ +.I " -c 'ALTER TABLE odds_games DROP COLUMN away_team_starter_id;'" +ALTER TABLE +.I $ psql -U postgres -d htsn \\\\ +.I " -c 'ALTER TABLE odds_games DROP COLUMN away_team_starter_name;'" +ALTER TABLE +.I $ psql -U postgres -d htsn \\\\ +.I " -c 'ALTER TABLE odds_games DROP COLUMN home_team_starter_id;'" +ALTER TABLE +.I $ psql -U postgres -d htsn \\\\ +.I " -c 'ALTER TABLE odds_games DROP COLUMN home_team_starter_name;'" +ALTER TABLE +.fi +.P +After manually adjusting the schema, the import should succeed. + +.SH XML SCHEMA ODDITIES +.P +There are a number of problems with the XML on the wire. Even if we +construct the DTDs ourselves, the results are sometimes +inconsistent. Here we document a few of them. + +.IP \[bu] 2 +\fIOdds_XML.dtd\fR + +The elements here are supposed to be associated with a set of + elements, but since the pair +(......) can appear zero or more times, +this leads to ambiguity in parsing. We therefore ignore the notes +entirely (although a hack is employed to facilitate parsing). The same +thing goes for the newer element. + +.IP \[bu] +\fIweatherxml.dtd\fR + +There appear to be two types of weather documents; the first has + contained within and the second has +contained within . While it would be possible to parse both, +it would greatly complicate things. The first form is more common, so +that's all we support for now. An example is provided as +schemagen/weatherxml/20143655.xml. + +.SH OPTIONS + +.IP \fB\-\-backend\fR,\ \fB\-b\fR +The RDBMS backend to use. Valid choices are \fISqlite\fR and +\fIPostgres\fR. Capitalization is important, sorry. + +Default: Sqlite + +.IP \fB\-\-connection-string\fR,\ \fB\-c\fR +The connection string used for connecting to the database backend +given by the \fB\-\-backend\fR option. The default is appropriate for +the \fISqlite\fR backend. + +Default: \(dq:memory:\(dq + +.IP \fB\-\-log-file\fR +If you specify a file here, logs will be written to it (possibly in +addition to syslog). Can be either a relative or absolute path. It +will not be auto-rotated; use something like logrotate for that. + +Default: none + +.IP \fB\-\-log-level\fR +How verbose should the logs be? We log notifications at four levels: +DEBUG, INFO, WARN, and ERROR. Specify the \(dqmost boring\(dq level of +notifications you would like to receive (in all-caps); more +interesting notifications will be logged as well. The debug output is +extremely verbose and will not be written to syslog even if you try. + +Default: INFO + +.IP \fB\-\-remove\fR,\ \fB\-r\fR +Remove successfully processed files. If you enable this, you can see +at a glance which XML files are not being processed, because they're +all that should be left. + +Default: disabled + +.IP \fB\-\-syslog\fR,\ \fB\-s\fR +Enable logging to syslog. On Windows this will attempt to communicate +(over UDP) with a syslog daemon on localhost, which will most likely +not work. + +Default: disabled + +.SH CONFIGURATION FILE +.P +Any of the command-line options mentioned above can be specified in a +configuration file instead. We first look for \(dqhtsn-importrc\(dq in +the system configuration directory. We then look for a file named +\(dq.htsn-importrc\(dq in the user's home directory. The latter will +override the former. +.P +The user's home directory is simply $HOME on Unix; on Windows it's +wherever %APPDATA% points. The system configuration directory is +determined by Cabal; the \(dqsysconfdir\(dq parameter during the +\(dqconfigure\(dq step is used. +.P +The file's syntax is given by examples in the htsn-importrc.example file +(included with \fBhtsn-import\fR). +.P +Options specified on the command-line override those in either +configuration file. + +.SH EXAMPLES +.IP \[bu] 2 +Import newsxml.xml into a preexisting sqlite database named \(dqfoo.sqlite3\(dq: + +.nf +.I $ htsn-import --connection-string='foo.sqlite3' \\\\ +.I " test/xml/newsxml.xml" +Successfully imported test/xml/newsxml.xml. +Imported 1 document(s) total. +.fi +.IP \[bu] +Repeat the previous example, but delete newsxml.xml afterwards: + +.nf +.I $ htsn-import --connection-string='foo.sqlite3' \\\\ +.I " --remove test/xml/newsxml.xml" +Successfully imported test/xml/newsxml.xml. +Imported 1 document(s) total. +Removed processed file test/xml/newsxml.xml. +.fi +.IP \[bu] +Use a Postgres database instead of the default Sqlite. This assumes +that you have a database named \(dqhtsn\(dq accessible to user +\(dqpostgres\(dq locally: + +.nf +.I $ htsn-import --connection-string='dbname=htsn user=postgres' \\\\ +.I " --backend=Postgres test/xml/newsxml.xml" +Successfully imported test/xml/newsxml.xml. +Imported 1 document(s) total. +.fi + +.SH BUGS + +.P +Send bugs to michael@orlitzky.com. + +.SH APPENDIX: SUPPORTED DOCUMENT TYPES .P The XML document types obtained from the feed are uniquely identified by their DTDs. We currently support documents with the following DTDs: @@ -71,7 +392,7 @@ weatherxml.dtd .IP \[bu] GameInfo .RS -.IP \[bu] +.IP \[bu] 2 CBASK_Lineup_XML.dtd .IP \[bu] cbaskpreviewxml.dtd @@ -115,7 +436,7 @@ WorldBaseballPreviewXML.dtd .IP \[bu] SportInfo .RS -.IP \[bu] +.IP \[bu] 2 CBASK_3PPctXML.dtd .IP \[bu] Cbask_All_Tourn_Teams_XML.dtd @@ -376,185 +697,3 @@ WNBAStealsXML.dtd .IP \[bu] WNBATurnoversXML.dtd .RE -.P -The GameInfo and SportInfo types do not have their own top-level -tables in the database. Instead, their raw XML is stored in either the -\(dqgame_info\(dq or \(dqsport_info\(dq table respectively. - -.SH DATABASE SCHEMA -.P -At the top level (with two notable exceptions), we have one table for -each of the XML document types that we import. For example, the -documents corresponding to \fInewsxml.dtd\fR will have a table called -\(dqnews\(dq. All top-level tables contain two important fields, -\(dqxml_file_id\(dq and \(dqtime_stamp\(dq. The former is unique and -prevents us from inserting the same data twice. The time stamp on the -other hand lets us know when the data is old and can be removed. The -database schema make it possible to delete only the outdated top-level -records; all transient children should be removed by triggers. -.P -These top-level tables will often have children. For example, each -news item has zero or more locations associated with it. The child -table will be named _, which in this case -corresponds to \(dqnews_locations\(dq. -.P -To relate the two, a third table may exist with name -__. Note the two underscores. This prevents ambiguity -when the child table itself contains underscores. The table joining -\(dqnews\(dq with \(dqnews_locations\(dq is thus called -\(dqnews__news_locations\(dq. This is necessary when the child table -has a unique constraint; we don't want to blindly insert duplicate -records keyed to the parent. Instead we'd like to use the third table -to map an existing child to the new parent. -.P -Where it makes sense, children are kept unique to prevent pointless -duplication. This slows down inserts, and speeds up reads (which are -much more frequent). There is a tradeoff to be made, however. For a -table with a small, fixed upper bound on the number of rows (like -\(dqodds_casinos\(dq), there is great benefit to de-duplication. The -total number of rows stays small, so inserts are still quick, and many -duplicate rows are eliminated. -.P -But, with a table like \(dqodds_games\(dq, the number of games grows -quickly and without bound. It is therefore more beneficial to be able -to delete the old games (through an ON DELETE CASCADE, tied to -\(dqodds\(dq) than it is to eliminate duplication. A table like -\(dqnews_locations\(dq is somewhere in-between. It is hoped that the -unique constraint in the top-level table's \(dqxml_file_id\(dq will -prevent duplication in this case anyway. -.P -The aforementioned exceptions are the \(dqgame_info\(dq and -\(dqsport_info\(dq tables. These tables contain the raw XML for a -number of DTDs that are not handled individually. This is partially -for backwards-compatibility with a legacy implementation, but is -mostly a stopgap due to a lack of resources at the moment. These two -tables (game_info and sport_info) still possess timestamps that allow -us to prune old data. -.P -UML diagrams of the resulting database schema for each XML document -type are provided with the \fBhtsn-import\fR documentation. - -.SH XML Schema Oddities -.P -There are a number of problems with the XML on the wire. Even if we -construct the DTDs ourselves, the results are sometimes -inconsistent. Here we document a few of them. - -.IP \[bu] 2 -Odds_XML.dtd - -The elements here are supposed to be associated with a set of - elements, but since the pair -(......) can appear zero or more times, -this leads to ambiguity in parsing. We therefore ignore the notes -entirely (although a hack is employed to facilitate parsing). - -.IP \[bu] -weatherxml.dtd - -There appear to be two types of weather documents; the first has - contained within and the second has -contained within . While it would be possible to parse both, -it would greatly complicate things. The first form is more common, so -that's all we support for now. - -.SH OPTIONS - -.IP \fB\-\-backend\fR,\ \fB\-b\fR -The RDBMS backend to use. Valid choices are \fISqlite\fR and -\fIPostgres\fR. Capitalization is important, sorry. - -Default: Sqlite - -.IP \fB\-\-connection-string\fR,\ \fB\-c\fR -The connection string used for connecting to the database backend -given by the \fB\-\-backend\fR option. The default is appropriate for -the \fISqlite\fR backend. - -Default: \(dq:memory:\(dq - -.IP \fB\-\-log-file\fR -If you specify a file here, logs will be written to it (possibly in -addition to syslog). Can be either a relative or absolute path. It -will not be auto-rotated; use something like logrotate for that. - -Default: none - -.IP \fB\-\-log-level\fR -How verbose should the logs be? We log notifications at four levels: -DEBUG, INFO, WARN, and ERROR. Specify the \(dqmost boring\(dq level of -notifications you would like to receive (in all-caps); more -interesting notifications will be logged as well. The debug output is -extremely verbose and will not be written to syslog even if you try. - -Default: INFO - -.IP \fB\-\-remove\fR,\ \fB\-r\fR -Remove successfully processed files. If you enable this, you can see -at a glance which XML files are not being processed, because they're -all that should be left. - -Default: disabled - -.IP \fB\-\-syslog\fR,\ \fB\-s\fR -Enable logging to syslog. On Windows this will attempt to communicate -(over UDP) with a syslog daemon on localhost, which will most likely -not work. - -Default: disabled - -.SH CONFIGURATION FILE -.P -Any of the command-line options mentioned above can be specified in a -configuration file instead. We first look for \(dqhtsn-importrc\(dq in -the system configuration directory. We then look for a file named -\(dq.htsn-importrc\(dq in the user's home directory. The latter will -override the former. -.P -The user's home directory is simply $HOME on Unix; on Windows it's -wherever %APPDATA% points. The system configuration directory is -determined by Cabal; the \(dqsysconfdir\(dq parameter during the -\(dqconfigure\(dq step is used. -.P -The file's syntax is given by examples in the htsn-importrc.example file -(included with \fBhtsn-import\fR). -.P -Options specified on the command-line override those in either -configuration file. - -.SH EXAMPLES -.IP \[bu] 2 -Import newsxml.xml into a preexisting sqlite database named \(dqfoo.sqlite3\(dq: - -.nf -.I $ htsn-import --connection-string='foo.sqlite3' \\\\ -.I " test/xml/newsxml.xml" -Successfully imported test/xml/newsxml.xml. -Imported 1 document(s) total. -.fi -.IP \[bu] -Repeat the previous example, but delete newsxml.xml afterwards: - -.nf -.I $ htsn-import --connection-string='foo.sqlite3' \\\\ -.I " --remove test/xml/newsxml.xml" -Successfully imported test/xml/newsxml.xml. -Imported 1 document(s) total. -Removed processed file test/xml/newsxml.xml. -.fi -.IP \[bu] -Use a Postgres database instead of the default Sqlite. This assumes -that you have a database named \(dqhtsn\(dq accessible to user -\(dqpostgres\(dq locally: - -.nf -.I $ htsn-import --connection-string='dbname=htsn user=postgres' \\\\ -.I " --backend=Postgres test/xml/newsxml.xml" -Successfully imported test/xml/newsxml.xml. -Imported 1 document(s) total. -.fi - -.SH BUGS - -.P -Send bugs to michael@orlitzky.com. diff --git a/htsn-import.cabal b/htsn-import.cabal index 7faaf02..81264dc 100644 --- a/htsn-import.cabal +++ b/htsn-import.cabal @@ -11,9 +11,7 @@ extra-source-files: doc/dbschema/*.png doc/htsn-importrc.example doc/man1/htsn-import.1 - doc/README.dbschema doc/README.development - doc/README.schemagen doc/TODO makefile schema/*.dtd -- 2.43.2