X-Git-Url: http://gitweb.michael.orlitzky.com/?a=blobdiff_plain;f=doc%2Fman1%2Fhtsn-import.1;h=426d18945d7207c79f8cfdcdc029e402c78cb26e;hb=16d86e7a3c1eda08b91752f92510a1de0b952a17;hp=0816be35d33cc5eb8a035dc551e40d7e05f1b44b;hpb=ec6c5b56f8e3096786e8f0a0d3c5c3c1610b69f7;p=dead%2Fhtsn-import.git
diff --git a/doc/man1/htsn-import.1 b/doc/man1/htsn-import.1
index 0816be3..66f7ae5 100644
--- a/doc/man1/htsn-import.1
+++ b/doc/man1/htsn-import.1
@@ -8,26 +8,901 @@ htsn-import \- Import XML files from The Sports Network into an RDBMS.
\fBhtsn-import\fR [OPTIONS] [FILES]
.SH DESCRIPTION
+.P
+The Sports Network offers an XML feed
+containing various sports news and statistics. Our sister program
+\fBhtsn\fR is capable of retrieving the feed and saving the individual
+XML documents contained therein. But what to do with them?
+.P
+The purpose of \fBhtsn-import\fR is to take these XML documents and
+get them into something we can use, a relational database management
+system (RDBMS), otherwise known as a SQL database. The structure of
+relational database, is, well, relational, and the feed XML is not. So
+there is some work to do before the data can be imported into the
+database.
+.P
+First, we must parse the XML. Each supported document type (see below)
+has a full pickle/unpickle implementation (\(dqpickle\(dq is simply a
+synonym for \(dqserialize\(dq here). That means that we parse the
+entire document into a data structure, and if we pickle (serialize)
+that data structure, we get the exact same XML document tha we started
+with.
+.P
+This is important for two reasons. First, it serves as a second level
+of validation. The first validation is performed by the XML parser,
+but if that succeeds and unpicking fails, we know that something is
+fishy. Second, we don't ever want to be surprised by some new element
+or attribute showing up in the XML. The fact that we can unpickle the
+whole thing now means that we won't be surprised in the future.
+.P
+The aforementioned feature is especially important because we
+automatically migrate the database schema every time we import a
+document. If you attempt to import a \(dqnewsxml.dtd\(dq document, all
+database objects relating to the news will be created if they do not
+exist. We don't want the schema to change out from under us without
+warning, so it's important that no XML be parsed that would result in
+a different schema than we had previously. Since we can
+pickle/unpickle everything already, this should be impossible.
+.P
+A list of supported document types is given in the appendix.
+.P
+The GameInfo and SportInfo types do not have their own top-level
+tables in the database. Instead, their raw XML is stored in either the
+\(dqgame_info\(dq or \(dqsport_info\(dq table respectively.
.SH DATABASE SCHEMA
.P
-At the top level, we have one table for each of the XML document types
-that we import. For example, the documents corresponding to
-\fInewsxml.dtd\fR will have a table called \(dqnews\(dq.
+At the top level (with two notable exceptions), we have one table for
+each of the XML document types that we import. For example, the
+documents corresponding to \fInewsxml.dtd\fR will have a table called
+\(dqnews\(dq. All top-level tables contain two important fields,
+\(dqxml_file_id\(dq and \(dqtime_stamp\(dq. The former is unique and
+prevents us from inserting the same data twice. The time stamp on the
+other hand lets us know when the data is old and can be removed. The
+database schema make it possible to delete only the outdated top-level
+records; all transient children should be removed by triggers.
.P
These top-level tables will often have children. For example, each
news item has zero or more locations associated with it. The child
table will be named _, which in this case
-corresponsds to \(dqnews_locations\(dq.
-.P
-To relate the two, a third table exists with name __. Note the two underscores. This prevents
-ambiguity when the child table itself contains underscores. As long we
-never go more than one level down, this system should suffice. The
-table joining \(dqnews\(dq with \(dqnews_locations\(dq is thus called
-\(dqnews__news_locations\(dq.
-.P
-Wherever possible, children are kept unique to prevent pointless
-duplication. This slows down inserts, and speeds up reads (which we
-assume are much more frequent). The current rate at which the feed
-transmits XML is much too slow to cause problems inserting.
+corresponds to \(dqnews_locations\(dq.
+.P
+To relate the two, a third table may exist with name
+__. Note the two underscores. This prevents ambiguity
+when the child table itself contains underscores. The table joining
+\(dqnews\(dq with \(dqnews_locations\(dq is thus called
+\(dqnews__news_locations\(dq. This is necessary when the child table
+has a unique constraint; we don't want to blindly insert duplicate
+records keyed to the parent. Instead we'd like to use the third table
+to map an existing child to the new parent.
+.P
+Where it makes sense, children are kept unique to prevent pointless
+duplication. This slows down inserts, and speeds up reads (which are
+much more frequent). There is a tradeoff to be made, however. For a
+table with a small, fixed upper bound on the number of rows (like
+\(dqodds_casinos\(dq), there is great benefit to de-duplication. The
+total number of rows stays small, so inserts are still quick, and many
+duplicate rows are eliminated.
+.P
+But, with a table like \(dqodds_games\(dq, the number of games grows
+quickly and without bound. It is therefore more beneficial to be able
+to delete the old games (through an ON DELETE CASCADE, tied to
+\(dqodds\(dq) than it is to eliminate duplication. A table like
+\(dqnews_locations\(dq is somewhere in-between. It is hoped that the
+unique constraint in the top-level table's \(dqxml_file_id\(dq will
+prevent duplication in this case anyway.
+.P
+The aforementioned exceptions are the \(dqgame_info\(dq and
+\(dqsport_info\(dq tables. These tables contain the raw XML for a
+number of DTDs that are not handled individually. This is partially
+for backwards-compatibility with a legacy implementation, but is
+mostly a stopgap due to a lack of resources at the moment. These two
+tables (game_info and sport_info) still possess timestamps that allow
+us to prune old data.
+.P
+UML diagrams of the resulting database schema for each XML document
+type are provided with the \fBhtsn-import\fR documentation, in the
+\fIdoc/dbschema\fR directory. These are not authoritative, but it
+should be considered a bug if they are incorrect. The diagrams are
+created using the pgModeler tool.
+
+.SH NULL POLICY
+.P
+Normally in a database one makes a distinction between fields that
+simply don't exist, and those fields that are
+\(dqempty\(dq. Translating from XML, there is a natural way to
+determine which one should be used: if an element is present in the
+XML document but its contents are empty, then an empty string should
+be inserted into the corresponding field. If on the other hand the
+element is missing entirely, the corresponding database entry should
+be NULL to indicate that fact.
+.P
+This sounds well and good, but the XML must be consistent for the
+database consumer to make any sense of what he sees. The feed XML uses
+optional and blank elements interchangeably, and without any
+discernable pattern. To propagate this pattern into the database would
+only cause confusion.
+.P
+As a result, a policy was adopted: both optional elements and elements
+whose contents can be empty will be considered nullable in the
+database. If the element is missing, the corresponding field is
+NULL. Likewise if the content is simply missing. That means there
+should never be a (completely) empty string in a database column.
+
+.SH XML SCHEMA GENERATION
+.P
+In order to parse XML, you need to know the structure of your
+documents. Usually this is given in the form of a DTD or schema. The
+Sports Network does provide DTDs for their XML, but unfortunately many
+of them do not match the XML found on the feed.
+.P
+We need to construct a database into which to insert the XML. How do
+we know if should be a column, or if it should have its own
+table? We need to know how many times it can appear in the
+document. So we need some form of specification. Since the supplied
+DTDs are incorrect, we would like to generate them automatically.
+.P
+The process should go something like,
+.IP 1.
+Generate a DTD from the first foo.xml file we see. Call it foo.dtd.
+.IP 2.
+Validate future foo documents against foo.dtd. If they all validate,
+great. If one fails, add it to the corpus and update foo.dtd so
+that both the original and the new foo.xml validate.
+.IP 3.
+Repeat until no more failures occur. This can never be perfect:
+tomorrow we could get a foo.xml that's wildly different from what
+we've seen in the past. But it's the best we can hope for under
+the circumstances.
+.P
+Enter XML-Schema-learner
+. This tool can infer a
+DTD from a set of sample XML files. The top-level \(dqschemagen\(dq
+folder (in this project) contains a number of subfolders\(emone for
+each type of document that we want to parse. Contained therein are XML
+samples for that particular document type. These were hand-picked one
+at a time according to the procedure above, and the complete set of
+XML is what we use to generate the DTDs used by htsn-import.
+.P
+To generate them, run `make schema` at the project
+root. XML-Schema-learner will be invoked on each subfolder of
+\(dqschemagen\(dq and will output the corresponding DTDs to the
+\(dqschemagen\(dq folder.
+.P
+Most of the production schemas are generated this way; however, a few
+needed manual tweaking. The final, believed-to-be-correct schemas for
+all supported document types can be found in the \(dqschema\(dq folder in
+the project root. Having the correct DTDs available means you
+don't need XML-Schema-learner available to install \fBhtsn-import\fR.
+
+.SH XML SCHEMA UPDATES
+.P
+If a new tag is added to an XML document type, \fBhtsn-import\fR will
+most likely refuse to parse it, since the new documents no longer
+match the existing DTD.
+.P
+The first thing to do in that case is add the unparseable document to
+the \(dqschemagen\(dq directory, and generate a new DTD that matches
+both the old and new samples. Once a new, correct DTD has been
+generated, it should be added to the \(dqschema\(dq directory. Then,
+the parser can be updated and \fBhtsn-import\fR rebuilt.
+.P
+At this point, \fBhtsn-import\fR should be capable of importing the
+new document. But the addition of the new tag will most require new
+fields in the database. Fortunately, easy migrations like this are
+handled automatically. As an example, at one point, \fIOdds_XML.dtd\fR
+did not contain the \(dqHStarter\(dq and \(dqAStarter\(dq elements
+associated with its games. Suppose we parse one of the old documents
+(without \(dqHStarter\(dq and \(dqAStarter\(dq) using an old version
+of \fBhtsn-import\fR:
+.P
+.nf
+.I $ htsn-import --connection-string='foo.sqlite3' \\\\
+.I " schemagen/Odds_XML/19996433.xml"
+Migration: CREATE TABLE \(dqodds\(dq ...
+Successfully imported schemagen/Odds_XML/19996433.xml.
+Processed 1 document(s) total.
+.fi
+.P
+At this point, the database schema matches the old documents, i.e. the
+ones without \fIAStarter\fR and \fIHStarter\fR. If we use a new
+version of \fBhtsn-import\fR, supporting the new fields, the migration
+is handled gracefully:
+.P
+.nf
+.I $ htsn-import --connection-string='foo.sqlite3' \\\\
+.I " schemagen/Odds_XML/21315768.xml"
+Migration: ALTER TABLE \(dqodds_games\(dq
+ ADD COLUMN \(dqaway_team_starter_id\(dq INTEGER;
+Migration: ALTER TABLE \(dqodds_games\(dq
+ ADD COLUMN \(dqaway_team_starter_name\(dq VARCHAR;
+Migration: ALTER TABLE \(dqodds_games\(dq
+ ADD COLUMN \(dqhome_team_starter_id\(dq INTEGER;
+Migration: ALTER TABLE \(dqodds_games\(dq
+ ADD COLUMN \(dqhome_team_starter_name\(dq VARCHAR;
+Successfully imported schemagen/Odds_XML/21315768.xml.
+Processed 1 document(s) total.
+.fi
+.P
+If fields are removed from the schema, then manual intervention may be
+necessary:
+.P
+.nf
+.I $ htsn-import -b Postgres -c 'dbname=htsn user=postgres' \\\\
+.I " schemagen/Odds_XML/19996433.xml"
+ERROR: Database migration: manual intervention required.
+The following actions are considered unsafe:
+ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqaway_team_starter_id\(dq
+ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqaway_team_starter_name\(dq
+ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqhome_team_starter_id\(dq
+ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqhome_team_starter_name\(dq
+
+ERROR: Failed to import file schemagen/Odds_XML/19996433.xml.
+Processed 0 document(s) total.
+.fi
+.P
+To fix these errors, manually invoke the SQL commands that were
+considered unsafe:
+.P
+.nf
+.I $ psql -U postgres -d htsn \\\\
+.I " -c 'ALTER TABLE odds_games DROP COLUMN away_team_starter_id;'"
+ALTER TABLE
+.I $ psql -U postgres -d htsn \\\\
+.I " -c 'ALTER TABLE odds_games DROP COLUMN away_team_starter_name;'"
+ALTER TABLE
+.I $ psql -U postgres -d htsn \\\\
+.I " -c 'ALTER TABLE odds_games DROP COLUMN home_team_starter_id;'"
+ALTER TABLE
+.I $ psql -U postgres -d htsn \\\\
+.I " -c 'ALTER TABLE odds_games DROP COLUMN home_team_starter_name;'"
+ALTER TABLE
+.fi
+.P
+After manually adjusting the schema, the import should succeed.
+
+.SH XML SCHEMA ODDITIES
+.P
+There are a number of problems with the XML on the wire. Even if we
+construct the DTDs ourselves, the results are sometimes
+inconsistent. Here we document a few of them.
+
+.IP \[bu] 2
+\fInewsxml.dtd\fR
+
+The TSN DTD for news (and almost all XML on the wire) suggests that
+there is a exactly one (possibly-empty) element present in each
+message. However, we have seen an example (XML_File_ID 21232353) where
+an empty followed a non-empty one:
+
+.fi
+Odd Man Rush: Snow under pressure to improve Isles quickly
+
+.nf
+
+We don't parse this case at the moment, but we do recognize it and report
+it as unsupported so that offending documents can be removed. An example
+is provided as test/xml/newsxml-multiple-sms.xml.
+
+.IP \[bu]
+\fIOdds_XML.dtd\fR
+
+The elements here are supposed to be associated with a set of
+ elements, but since the pair
+(......) can appear zero or more times,
+this leads to ambiguity in parsing. We therefore ignore the notes
+entirely (although a hack is employed to facilitate parsing). The same
+thing goes for the newer element.
+
+.IP \[bu]
+\fIweatherxml.dtd\fR
+
+There appear to be two types of weather documents; the first has
+ contained within and the second has
+contained within . While it would be possible to parse both,
+it would greatly complicate things. The first form is more common, so
+that's all we support for now. An example is provided as
+test/xml/weatherxml-type2.xml.
+
+We are however able to identify the second type. When one is
+encountered, an informational message (that it is unsupported) will be
+printed. If the \fI\-\-remove\fR flag is used, the file will be
+deleted. This prevents documents that we know we can't import from
+building up.
+
+Another problem that comes up occasionally is that the home and away
+team elements appear in the reverse order. As in the other case, we
+report these as unsupported and then \(dqsucceed\(dq so that the
+offending document can be removed if desired. An example is provided
+as test/xml/weatherxml-backwards-teams.xml.
+
+.SH DATE/TIME ISSUES
+
+Dates and times appear in a number of places on the feed. The date
+portions are usually, fine, but the times often lack important
+information such as the time zone, or whether \(dq8 o'clock\(dq means
+a.m. or p.m.
+
+The most pervasive issue occurs with the timestamps that are included
+in every message. A typical timestamp looks like,
+
+.nf
+ May 24, 2014, at 04:18 PM ET
+.fi
+
+The \(dqtime zone\(dq is given as \(dqET\(dq, but unfortunately
+\(dqET\(dq is not a valid time zone. It stands for \(dqEastern
+Time\(dq, which can belong to either of two time zones, EST or EDT,
+based on the time of the year (i.e. whether or not daylight savings
+time is in effect). Since we can't tell from the timestamp, we always
+parse these as EST which is UTC-5. When daylight savings is in effect,
+they will be off by an hour.
+
+Here's a list of the ones that may cause surprises:
+
+.IP \[bu] 2
+\fIAutoRacingResultsXML.dtd\fR
+
+The elements contain a full date and time, but no time zone
+information:
+
+.nf
+5/24/2014 2:45:00 PM
+.fi
+
+We parse them as UTC, which will be wrong when stored,
+but \(dqcorrect\(dq if the new UTC time zone is ignored.
+
+.IP \[bu]
+\fIAuto_Racing_Schedule_XML.dtd\fR
+
+The and elements are combined into on field in
+the database, but no time zone information is given. For example,
+
+.nf
+02/16/2013
+08:10 PM
+.fi
+
+As a result, we parse and store the times as UTC. The race times are
+not always present in the database, but when they are missing, they
+are presented as \(dqTBA\(dq (to be announced):
+
+.nf
+TBA
+.fi
+
+Since the dates do not appear to be optional, we store only the race
+date in that case.
+
+.IP \[bu]
+\fIearlylineXML.dtd\fR
+
+The