warning, so it's important that no XML be parsed that would result in
a different schema than we had previously. Since we can
pickle/unpickle everything already, this should be impossible.
+.P
+A list of supported document types is given in the appendix.
+.P
+The GameInfo and SportInfo types do not have their own top-level
+tables in the database. Instead, their raw XML is stored in either the
+\(dqgame_info\(dq or \(dqsport_info\(dq table respectively.
+
+.SH DATABASE SCHEMA
+.P
+At the top level (with two notable exceptions), we have one table for
+each of the XML document types that we import. For example, the
+documents corresponding to \fInewsxml.dtd\fR will have a table called
+\(dqnews\(dq. All top-level tables contain two important fields,
+\(dqxml_file_id\(dq and \(dqtime_stamp\(dq. The former is unique and
+prevents us from inserting the same data twice. The time stamp on the
+other hand lets us know when the data is old and can be removed. The
+database schema make it possible to delete only the outdated top-level
+records; all transient children should be removed by triggers.
+.P
+These top-level tables will often have children. For example, each
+news item has zero or more locations associated with it. The child
+table will be named <parent>_<children>, which in this case
+corresponds to \(dqnews_locations\(dq.
+.P
+To relate the two, a third table may exist with name
+<parent>__<child>. Note the two underscores. This prevents ambiguity
+when the child table itself contains underscores. The table joining
+\(dqnews\(dq with \(dqnews_locations\(dq is thus called
+\(dqnews__news_locations\(dq. This is necessary when the child table
+has a unique constraint; we don't want to blindly insert duplicate
+records keyed to the parent. Instead we'd like to use the third table
+to map an existing child to the new parent.
+.P
+Where it makes sense, children are kept unique to prevent pointless
+duplication. This slows down inserts, and speeds up reads (which are
+much more frequent). There is a tradeoff to be made, however. For a
+table with a small, fixed upper bound on the number of rows (like
+\(dqodds_casinos\(dq), there is great benefit to de-duplication. The
+total number of rows stays small, so inserts are still quick, and many
+duplicate rows are eliminated.
+.P
+But, with a table like \(dqodds_games\(dq, the number of games grows
+quickly and without bound. It is therefore more beneficial to be able
+to delete the old games (through an ON DELETE CASCADE, tied to
+\(dqodds\(dq) than it is to eliminate duplication. A table like
+\(dqnews_locations\(dq is somewhere in-between. It is hoped that the
+unique constraint in the top-level table's \(dqxml_file_id\(dq will
+prevent duplication in this case anyway.
+.P
+The aforementioned exceptions are the \(dqgame_info\(dq and
+\(dqsport_info\(dq tables. These tables contain the raw XML for a
+number of DTDs that are not handled individually. This is partially
+for backwards-compatibility with a legacy implementation, but is
+mostly a stopgap due to a lack of resources at the moment. These two
+tables (game_info and sport_info) still possess timestamps that allow
+us to prune old data.
+.P
+UML diagrams of the resulting database schema for each XML document
+type are provided with the \fBhtsn-import\fR documentation, in the
+\fIdoc/dbschema\fR directory. These are not authoritative, but it
+should be considered a bug if they are incorrect. The diagrams are
+created using the pgModeler <http://www.pgmodeler.com.br/> tool.
-.SH SUPPORTED DOCUMENT TYPES
+.SH XML SCHEMA GENERATION
+.P
+In order to parse XML, you need to know the structure of your
+documents. Usually this is given in the form of a DTD or schema. The
+Sports Network does provide DTDs for their XML, but unfortunately many
+of them do not match the XML found on the feed.
+.P
+We need to construct a database into which to insert the XML. How do
+we know if <game> should be a column, or if it should have its own
+table? We need to know how many times it can appear in the
+document. So we need some form of specification. Since the supplied
+DTDs are incorrect, we would like to generate them automatically.
+.P
+The process should go something like,
+.IP 1.
+Generate a DTD from the first foo.xml file we see. Call it foo.dtd.
+.IP 2.
+Validate future foo documents against foo.dtd. If they all validate,
+great. If one fails, add it to the corpus and update foo.dtd so
+that both the original and the new foo.xml validate.
+.IP 3.
+Repeat until no more failures occur. This can never be perfect:
+tomorrow we could get a foo.xml that's wildly different from what
+we've seen in the past. But it's the best we can hope for under
+the circumstances.
+.P
+Enter XML-Schema-learner
+<https://github.com/kore/XML-Schema-learner>. This tool can infer a
+DTD from a set of sample XML files. The top-level \(dqschemagen\(dq
+folder (in this project) contains a number of subfolders\(emone for
+each type of document that we want to parse. Contained therein are XML
+samples for that particular document type. These were hand-picked one
+at a time according to the procedure above, and the complete set of
+XML is what we use to generate the DTDs used by htsn-import.
+.P
+To generate them, run `make schema` at the project
+root. XML-Schema-learner will be invoked on each subfolder of
+\(dqschemagen\(dq and will output the corresponding DTDs to the
+\(dqschemagen\(dq folder.
+.P
+Most of the production schemas are generated this way; however, a few
+needed manual tweaking. The final, believed-to-be-correct schemas for
+all supported document types can be found in the \(dqschema\(dq folder in
+the project root. Having the correct DTDs available means you
+don't need XML-Schema-learner available to install \fBhtsn-import\fR.
+
+.SH XML SCHEMA UPDATES
+.P
+If a new tag is added to an XML document type, \fBhtsn-import\fR will
+most likely refuse to parse it, since the new documents no longer
+match the existing DTD.
+.P
+The first thing to do in that case is add the unparseable document to
+the \(dqschemagen\(dq directory, and generate a new DTD that matches
+both the old and new samples. Once a new, correct DTD has been
+generated, it should be added to the \(dqschema\(dq directory. Then,
+the parser can be updated and \fBhtsn-import\fR rebuilt.
+.P
+At this point, \fBhtsn-import\fR should be capable of importing the
+new document. But the addition of the new tag will most require new
+fields in the database. Fortunately, easy migrations like this are
+handled automatically. As an example, at one point, \fIOdds_XML.dtd\fR
+did not contain the \(dqHStarter\(dq and \(dqAStarter\(dq elements
+associated with its games. Suppose we parse one of the old documents
+(without \(dqHStarter\(dq and \(dqAStarter\(dq) using an old version
+of \fBhtsn-import\fR:
+.P
+.nf
+.I $ htsn-import --connection-string='foo.sqlite3' \\\\
+.I " schemagen/Odds_XML/19996433.xml"
+Migration: CREATE TABLE \(dqodds\(dq ...
+Successfully imported schemagen/Odds_XML/19996433.xml.
+Processed 1 document(s) total.
+.fi
+.P
+At this point, the database schema matches the old documents, i.e. the
+ones without \fIAStarter\fR and \fIHStarter\fR. If we use a new
+version of \fBhtsn-import\fR, supporting the new fields, the migration
+is handled gracefully:
+.P
+.nf
+.I $ htsn-import --connection-string='foo.sqlite3' \\\\
+.I " schemagen/Odds_XML/21315768.xml"
+Migration: ALTER TABLE \(dqodds_games\(dq
+ ADD COLUMN \(dqaway_team_starter_id\(dq INTEGER;
+Migration: ALTER TABLE \(dqodds_games\(dq
+ ADD COLUMN \(dqaway_team_starter_name\(dq VARCHAR;
+Migration: ALTER TABLE \(dqodds_games\(dq
+ ADD COLUMN \(dqhome_team_starter_id\(dq INTEGER;
+Migration: ALTER TABLE \(dqodds_games\(dq
+ ADD COLUMN \(dqhome_team_starter_name\(dq VARCHAR;
+Successfully imported schemagen/Odds_XML/21315768.xml.
+Processed 1 document(s) total.
+.fi
+.P
+If fields are removed from the schema, then manual intervention may be
+necessary:
+.P
+.nf
+.I $ htsn-import -b Postgres -c 'dbname=htsn user=postgres' \\\\
+.I " schemagen/Odds_XML/19996433.xml"
+ERROR: Database migration: manual intervention required.
+The following actions are considered unsafe:
+ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqaway_team_starter_id\(dq
+ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqaway_team_starter_name\(dq
+ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqhome_team_starter_id\(dq
+ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqhome_team_starter_name\(dq
+
+ERROR: Failed to import file schemagen/Odds_XML/19996433.xml.
+Processed 0 document(s) total.
+.fi
+.P
+To fix these errors, manually invoke the SQL commands that were
+considered unsafe:
+.P
+.nf
+.I $ psql -U postgres -d htsn \\\\
+.I " -c 'ALTER TABLE odds_games DROP COLUMN away_team_starter_id;'"
+ALTER TABLE
+.I $ psql -U postgres -d htsn \\\\
+.I " -c 'ALTER TABLE odds_games DROP COLUMN away_team_starter_name;'"
+ALTER TABLE
+.I $ psql -U postgres -d htsn \\\\
+.I " -c 'ALTER TABLE odds_games DROP COLUMN home_team_starter_id;'"
+ALTER TABLE
+.I $ psql -U postgres -d htsn \\\\
+.I " -c 'ALTER TABLE odds_games DROP COLUMN home_team_starter_name;'"
+ALTER TABLE
+.fi
+.P
+After manually adjusting the schema, the import should succeed.
+
+.SH XML SCHEMA ODDITIES
+.P
+There are a number of problems with the XML on the wire. Even if we
+construct the DTDs ourselves, the results are sometimes
+inconsistent. Here we document a few of them.
+
+.IP \[bu] 2
+\fIOdds_XML.dtd\fR
+
+The <Notes> elements here are supposed to be associated with a set of
+<Game> elements, but since the pair
+(<Notes>...</Notes><Game>...</Game>) can appear zero or more times,
+this leads to ambiguity in parsing. We therefore ignore the notes
+entirely (although a hack is employed to facilitate parsing). The same
+thing goes for the newer <League_Name> element.
+
+.IP \[bu]
+\fIweatherxml.dtd\fR
+
+There appear to be two types of weather documents; the first has
+<listing> contained within <forecast> and the second has <forecast>
+contained within <listing>. While it would be possible to parse both,
+it would greatly complicate things. The first form is more common, so
+that's all we support for now. An example is provided as
+schemagen/weatherxml/20143655.xml.
+
+.SH OPTIONS
+
+.IP \fB\-\-backend\fR,\ \fB\-b\fR
+The RDBMS backend to use. Valid choices are \fISqlite\fR and
+\fIPostgres\fR. Capitalization is important, sorry.
+
+Default: Sqlite
+
+.IP \fB\-\-connection-string\fR,\ \fB\-c\fR
+The connection string used for connecting to the database backend
+given by the \fB\-\-backend\fR option. The default is appropriate for
+the \fISqlite\fR backend.
+
+Default: \(dq:memory:\(dq
+
+.IP \fB\-\-log-file\fR
+If you specify a file here, logs will be written to it (possibly in
+addition to syslog). Can be either a relative or absolute path. It
+will not be auto-rotated; use something like logrotate for that.
+
+Default: none
+
+.IP \fB\-\-log-level\fR
+How verbose should the logs be? We log notifications at four levels:
+DEBUG, INFO, WARN, and ERROR. Specify the \(dqmost boring\(dq level of
+notifications you would like to receive (in all-caps); more
+interesting notifications will be logged as well. The debug output is
+extremely verbose and will not be written to syslog even if you try.
+
+Default: INFO
+
+.IP \fB\-\-remove\fR,\ \fB\-r\fR
+Remove successfully processed files. If you enable this, you can see
+at a glance which XML files are not being processed, because they're
+all that should be left.
+
+Default: disabled
+
+.IP \fB\-\-syslog\fR,\ \fB\-s\fR
+Enable logging to syslog. On Windows this will attempt to communicate
+(over UDP) with a syslog daemon on localhost, which will most likely
+not work.
+
+Default: disabled
+
+.SH CONFIGURATION FILE
+.P
+Any of the command-line options mentioned above can be specified in a
+configuration file instead. We first look for \(dqhtsn-importrc\(dq in
+the system configuration directory. We then look for a file named
+\(dq.htsn-importrc\(dq in the user's home directory. The latter will
+override the former.
+.P
+The user's home directory is simply $HOME on Unix; on Windows it's
+wherever %APPDATA% points. The system configuration directory is
+determined by Cabal; the \(dqsysconfdir\(dq parameter during the
+\(dqconfigure\(dq step is used.
+.P
+The file's syntax is given by examples in the htsn-importrc.example file
+(included with \fBhtsn-import\fR).
+.P
+Options specified on the command-line override those in either
+configuration file.
+
+.SH EXAMPLES
+.IP \[bu] 2
+Import newsxml.xml into a preexisting sqlite database named \(dqfoo.sqlite3\(dq:
+
+.nf
+.I $ htsn-import --connection-string='foo.sqlite3' \\\\
+.I " test/xml/newsxml.xml"
+Successfully imported test/xml/newsxml.xml.
+Imported 1 document(s) total.
+.fi
+.IP \[bu]
+Repeat the previous example, but delete newsxml.xml afterwards:
+
+.nf
+.I $ htsn-import --connection-string='foo.sqlite3' \\\\
+.I " --remove test/xml/newsxml.xml"
+Successfully imported test/xml/newsxml.xml.
+Imported 1 document(s) total.
+Removed processed file test/xml/newsxml.xml.
+.fi
+.IP \[bu]
+Use a Postgres database instead of the default Sqlite. This assumes
+that you have a database named \(dqhtsn\(dq accessible to user
+\(dqpostgres\(dq locally:
+
+.nf
+.I $ htsn-import --connection-string='dbname=htsn user=postgres' \\\\
+.I " --backend=Postgres test/xml/newsxml.xml"
+Successfully imported test/xml/newsxml.xml.
+Imported 1 document(s) total.
+.fi
+
+.SH BUGS
+
+.P
+Send bugs to michael@orlitzky.com.
+
+.SH APPENDIX: SUPPORTED DOCUMENT TYPES
.P
The XML document types obtained from the feed are uniquely identified
by their DTDs. We currently support documents with the following DTDs:
.IP \[bu]
GameInfo
.RS
-.IP \[bu]
+.IP \[bu] 2
CBASK_Lineup_XML.dtd
.IP \[bu]
cbaskpreviewxml.dtd
.IP \[bu]
SportInfo
.RS
-.IP \[bu]
+.IP \[bu] 2
CBASK_3PPctXML.dtd
.IP \[bu]
Cbask_All_Tourn_Teams_XML.dtd
.IP \[bu]
WNBATurnoversXML.dtd
.RE
-.P
-The GameInfo and SportInfo types do not have their own top-level
-tables in the database. Instead, their raw XML is stored in either the
-\(dqgame_info\(dq or \(dqsport_info\(dq table respectively.
-
-.SH DATABASE SCHEMA
-.P
-At the top level (with two notable exceptions), we have one table for
-each of the XML document types that we import. For example, the
-documents corresponding to \fInewsxml.dtd\fR will have a table called
-\(dqnews\(dq. All top-level tables contain two important fields,
-\(dqxml_file_id\(dq and \(dqtime_stamp\(dq. The former is unique and
-prevents us from inserting the same data twice. The time stamp on the
-other hand lets us know when the data is old and can be removed. The
-database schema make it possible to delete only the outdated top-level
-records; all transient children should be removed by triggers.
-.P
-These top-level tables will often have children. For example, each
-news item has zero or more locations associated with it. The child
-table will be named <parent>_<children>, which in this case
-corresponds to \(dqnews_locations\(dq.
-.P
-To relate the two, a third table may exist with name
-<parent>__<child>. Note the two underscores. This prevents ambiguity
-when the child table itself contains underscores. The table joining
-\(dqnews\(dq with \(dqnews_locations\(dq is thus called
-\(dqnews__news_locations\(dq. This is necessary when the child table
-has a unique constraint; we don't want to blindly insert duplicate
-records keyed to the parent. Instead we'd like to use the third table
-to map an existing child to the new parent.
-.P
-Where it makes sense, children are kept unique to prevent pointless
-duplication. This slows down inserts, and speeds up reads (which are
-much more frequent). There is a tradeoff to be made, however. For a
-table with a small, fixed upper bound on the number of rows (like
-\(dqodds_casinos\(dq), there is great benefit to de-duplication. The
-total number of rows stays small, so inserts are still quick, and many
-duplicate rows are eliminated.
-.P
-But, with a table like \(dqodds_games\(dq, the number of games grows
-quickly and without bound. It is therefore more beneficial to be able
-to delete the old games (through an ON DELETE CASCADE, tied to
-\(dqodds\(dq) than it is to eliminate duplication. A table like
-\(dqnews_locations\(dq is somewhere in-between. It is hoped that the
-unique constraint in the top-level table's \(dqxml_file_id\(dq will
-prevent duplication in this case anyway.
-.P
-The aforementioned exceptions are the \(dqgame_info\(dq and
-\(dqsport_info\(dq tables. These tables contain the raw XML for a
-number of DTDs that are not handled individually. This is partially
-for backwards-compatibility with a legacy implementation, but is
-mostly a stopgap due to a lack of resources at the moment. These two
-tables (game_info and sport_info) still possess timestamps that allow
-us to prune old data.
-.P
-UML diagrams of the resulting database schema for each XML document
-type are provided with the \fBhtsn-import\fR documentation.
-
-.SH XML Schema Oddities
-.P
-There are a number of problems with the XML on the wire. Even if we
-construct the DTDs ourselves, the results are sometimes
-inconsistent. Here we document a few of them.
-
-.IP \[bu] 2
-Odds_XML.dtd
-
-The <Notes> elements here are supposed to be associated with a set of
-<Game> elements, but since the pair
-(<Notes>...</Notes><Game>...</Game>) can appear zero or more times,
-this leads to ambiguity in parsing. We therefore ignore the notes
-entirely (although a hack is employed to facilitate parsing).
-
-.IP \[bu]
-weatherxml.dtd
-
-There appear to be two types of weather documents; the first has
-<listing> contained within <forecast> and the second has <forecast>
-contained within <listing>. While it would be possible to parse both,
-it would greatly complicate things. The first form is more common, so
-that's all we support for now.
-
-.SH OPTIONS
-
-.IP \fB\-\-backend\fR,\ \fB\-b\fR
-The RDBMS backend to use. Valid choices are \fISqlite\fR and
-\fIPostgres\fR. Capitalization is important, sorry.
-
-Default: Sqlite
-
-.IP \fB\-\-connection-string\fR,\ \fB\-c\fR
-The connection string used for connecting to the database backend
-given by the \fB\-\-backend\fR option. The default is appropriate for
-the \fISqlite\fR backend.
-
-Default: \(dq:memory:\(dq
-
-.IP \fB\-\-log-file\fR
-If you specify a file here, logs will be written to it (possibly in
-addition to syslog). Can be either a relative or absolute path. It
-will not be auto-rotated; use something like logrotate for that.
-
-Default: none
-
-.IP \fB\-\-log-level\fR
-How verbose should the logs be? We log notifications at four levels:
-DEBUG, INFO, WARN, and ERROR. Specify the \(dqmost boring\(dq level of
-notifications you would like to receive (in all-caps); more
-interesting notifications will be logged as well. The debug output is
-extremely verbose and will not be written to syslog even if you try.
-
-Default: INFO
-
-.IP \fB\-\-remove\fR,\ \fB\-r\fR
-Remove successfully processed files. If you enable this, you can see
-at a glance which XML files are not being processed, because they're
-all that should be left.
-
-Default: disabled
-
-.IP \fB\-\-syslog\fR,\ \fB\-s\fR
-Enable logging to syslog. On Windows this will attempt to communicate
-(over UDP) with a syslog daemon on localhost, which will most likely
-not work.
-
-Default: disabled
-
-.SH CONFIGURATION FILE
-.P
-Any of the command-line options mentioned above can be specified in a
-configuration file instead. We first look for \(dqhtsn-importrc\(dq in
-the system configuration directory. We then look for a file named
-\(dq.htsn-importrc\(dq in the user's home directory. The latter will
-override the former.
-.P
-The user's home directory is simply $HOME on Unix; on Windows it's
-wherever %APPDATA% points. The system configuration directory is
-determined by Cabal; the \(dqsysconfdir\(dq parameter during the
-\(dqconfigure\(dq step is used.
-.P
-The file's syntax is given by examples in the htsn-importrc.example file
-(included with \fBhtsn-import\fR).
-.P
-Options specified on the command-line override those in either
-configuration file.
-
-.SH EXAMPLES
-.IP \[bu] 2
-Import newsxml.xml into a preexisting sqlite database named \(dqfoo.sqlite3\(dq:
-
-.nf
-.I $ htsn-import --connection-string='foo.sqlite3' \\\\
-.I " test/xml/newsxml.xml"
-Successfully imported test/xml/newsxml.xml.
-Imported 1 document(s) total.
-.fi
-.IP \[bu]
-Repeat the previous example, but delete newsxml.xml afterwards:
-
-.nf
-.I $ htsn-import --connection-string='foo.sqlite3' \\\\
-.I " --remove test/xml/newsxml.xml"
-Successfully imported test/xml/newsxml.xml.
-Imported 1 document(s) total.
-Removed processed file test/xml/newsxml.xml.
-.fi
-.IP \[bu]
-Use a Postgres database instead of the default Sqlite. This assumes
-that you have a database named \(dqhtsn\(dq accessible to user
-\(dqpostgres\(dq locally:
-
-.nf
-.I $ htsn-import --connection-string='dbname=htsn user=postgres' \\\\
-.I " --backend=Postgres test/xml/newsxml.xml"
-Successfully imported test/xml/newsxml.xml.
-Imported 1 document(s) total.
-.fi
-
-.SH BUGS
-
-.P
-Send bugs to michael@orlitzky.com.